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ARE Algebraic Riccati Equation 
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quadratic program of the residual-based method 

Input matrix 

Real constant 

Jacobi matrix in MaxEnt IRL 

System dynamics 

Closed-loop system matrix 

Running costs in a cost function 

Gravitational constant 

First derivative of J w.r.t. controls in MaxEnt IRL 

Control matrix in a control-affine nonlinear system 

Second derivative of J w.r.t. controls in MaxEnt IRL 

Terminal costs in a cost function 
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Discrete time step 
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Feedback matrix (linear feedback strategy) 
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Dimension of control vector 

Dimension of the cost function parameter vector / Torque ap- 
plied by players in ball-on-beam example 

Matrix for the solution of an ILQDG with a feedback information 
structure 
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Vector of cost function parameters 
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Symbol Description 
Time derivative 
uv Moore-Penrose inverse 
A Variable corresponding to a Nash Equilibrium 
b Variable corresponding to the ball of a ball-on-beam system 
D Variable corresponding to a discrete-time (infinite) dynamic 
game 
FB Variable corresponding to a feedback information pattern or 
strategy 
; Variable corresponding to player i 
Ir: Variables corresponding to all players except player i 
(i) j-th entry of a vector 
(rc) Entry of a matrix in the r-th row and the c-th column 
Ke) Variable corresponding to the stage k in a discrete-time dynamic 
game 
(9 Variable corresponding to the subject pair s in the experimental 
results 
max Maximum value 
ke Mean value 
median Median value 
QL Variable corresponding to an open-loop information pattern or 
strategy 
P Variable corresponding to a Pareto efficient solution 
p Variable corresponding to a cost function for Pareto efficient so- 
lutions 
2 Variable corresponding to a Stackelberg Equilibrium 
SD Standard deviation 
W Variable corresponding to the beam of a ball-on-beam system 
0 Variable associated to parameters 0 
š Measured or observed variable 
f Estimated variable 
Sequence of variables 


1 Introduction 


Automatic and intelligent machines have become ever-present in today’s society. Previously 
developed for industrial environments to perform repetitive tasks on their own and out of 
human reach, the robots and automation systems of today interact closely with humans and 
several other robotic systems. Current trends of technological development entail an even 
closer interaction, for instance, at a haptic level. This means that machines physically inter- 
act with a cooperation partner, e.g. a human, in order to assist him in the more efficient and 
safe completion of various tasks. Such a close interaction is given in the fields of coopera- 
tive industrial robots, robot-assisted surgery and assistance systems for vehicle control and 
various other human-machine cooperation settings. Therefore, automated robotic systems 
increasingly need the ability to predict the behavior of the humans or previously unknown 
machines that may interact with them. This ability is a crucial part for the design of such coop- 
erative systems and for the exploitation of the full potential of cooperation synergies. Hence, 
adequate modeling and identification methods are essential; such mathematical models and 
suitable identification approaches can lead to a better general understanding of interacting 
agents and also to the possibility of implementing model-based control algorithms in a tech- 
nical device for an adequate behavior during interaction with e.g. a human partner. 


The aforementioned situation demands a modeling framework which, on the one hand, serves 
as a mathematical approach for the design of the automatic controller, but on the other hand, 
allows the description of human behavior. Descriptive and biologically interpretable models 
for human behavior have been explored in the biologic and neuroscientific communities. In 
particular, motor control of humans has been conjectured to arise from minimum principles 
[NC61]. Several optimality principles have been proposed to explain the generation of a spe- 
cific trajectory which serves as a command to lower-level biomechanical models (see [Eng01] 
for an extensive review). Given these optimality criteria, optimal control theory arises nat- 
urally as a model for movement planning and generation [Tod04] and has become a widely 
accepted approach in the neuroscience community. This led to further work which used this 
approach to model not only different kinds of human movement [MTL10, EHAAM16], but 
also the behavior of a human controlling a dynamic system [PCC*15]. The theory of opti- 
mal control itself is one of the most applied concepts in automatic control with numerous 
applications in engineering. Using this concept, an automatic controller can be described by 
a particular cost function as this leads to a control law which determines its behavior. 


In the general case with humans and machines interacting and cooperating with each other, 
either in terms of self-positioning (e.g. avoiding collision) or through the control of a dy- 
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(a) Agents interacting with each other (b) Agents interacting through a dynamic system 


Figure 1.1: Different scenarios of interaction between several agents 


namic system (e.g. haptic shared control of a vehicle), as depicted in Figure 1.1, a possibly 
conflicting situation emerges. This is due to the fact that human and machine strive each for 
the optimization of their own individual criterion, thus potentially affecting each other nega- 
tively. Conflicts in dynamic situations, the latter of which arise in engineering problems, can 
be described by dynamic game theory, a framework which has been increasingly employed 
for applications in automatic control [Isa99, RBS16] as well as economics [Doc00] and biology 
[MGP*18]. In other words, the mathematical framework of dynamic game theory not only 
includes modeling the behavior of each partner by means of a criterion to be optimized, but 
also allows for the analysis of the result of their interaction. This result is typically described 
by an equilibrium solution, the computation of which has been the object of considerable 
efforts (cf. [BO99]). In addition, first studies exist which demonstrate the potential of the 
so-called Nash equilibrium as a descriptive concept for biological systems, for instance, bird 
collision avoidance behavior [MGP* 18] as well as interacting humans in avoidance behavior 
[TW19] and in haptically coupled scenarios [BOW09, CS17, IFH19]. 


However, calculating equilibrium solutions in dynamic games demands the knowledge of the 
criteria each of the players optimize, which in real scenarios are typically unknown. Indeed, 
intelligent automated systems will usually have incomplete information about other play- 
ers. Moreover, in human-machine interaction, the objective function of the human partner 
is usually unknown. For instance, in highly automated driving scenarios, an autonomous 
driving car would not have knowledge of the objectives of other non-autonomous (human- 
controlled) vehicles. In these cases, if only measurement data is available, the objectives of 
the players have to be identified out of a given outcome of the interaction, i.e. players’ actions 
and system states corresponding to a game-theoretic equilibrium. In order to permit a major 
breakthrough of the application of dynamic game theory, efficient data-based identification 
of the criteria each of the players optimized becomes essential. This identification problem 
is denoted as inverse dynamic game and its solution is the main research objective of this 
thesis. 
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1.1 Research Objective and Contributions 


The main objective of this thesis is the development of methods for solving inverse dy- 
namic game problems, allowing for an estimation of player objectives from observed inter- 
action behavior. Contrary to the problem of determining equilibrium solutions from known 
objectives which has been extensively studied, the inverse problem has scarcely been con- 
sidered in previous work. Most treatments consider special cases, propose computationally 
heavy methods and do not give further insight on the properties of the problem. Motivated 
by the aforementioned studies on human-human-interaction, the focus of this thesis are dy- 
namic games where a Nash equilibrium arises and defines the observed behavior. In addition, 
the efficiency of the methods is endeavoured in view of their utilization in real applications. 


In a broad sense, the following contributions are made and presented in this thesis: 


1. The development of efficient control-theoretical methods for inverse dynamic games as 
a means to identify cost functions of interacting players based on given observations. 
Furthermore, mathematical conditions for successful identification are developed. 


2. The development of an inverse dynamic game method based on an approach which 
stems from computer science and information theory, for which a proof of the unbi- 
asedness of the objective estimation is given. 


3. The application of the novel methods using both simulated data from different sce- 
narios and real data from a cooperative steering experiment with 52 participants. The 
performance of the developed methods and a state-of-the-art approach is compared 
and thoroughly analyzed. 


1.2 Outline 


The remainder of this thesis is structured as follows. 


In Chapter 2, related work and existing literature on the estimation of player objectives in 
optimal control and dynamic games are reviewed. The research gap is formalized in terms of 
concrete research questions which shall be answered in this thesis. Chapter 3 introduces the 
reader to the necessary mathematical fundamentals of dynamic game theory. In particular, 
existing results on the determination of equilibrium solutions are reviewed which lay the 
foundation of the developed inverse dynamic methods of this thesis. 


The main theoretical contributions are given in Chapters 4 to 6. Chapter 4 presents a formal 
definition of inverse dynamic game problems and presents a control-theoretical approach for 
open-loop inverse dynamic games. Furthermore, sufficient conditions for succesful identifi- 
cation of unique parameters will be presented. Inverse methods and analysis tools for the 
class of linear-quadratic (LQ) differential games are presented in Chapter 5; necessary and 
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sufficient conditions for identification of unique parameters are also given. In Chapter 6, a 
method based on inverse reinforcement learning is presented and shown to be adequate for 
solving inverse dynamic games with both open-loop and feedback information structures. 
The chapter also presents unbiasedness results for the estimation of player objectives with 
this approach. 


The next chapters involve the evaluation of the novel methods in simulations and a real ap- 
plication. First, Chapter 7 give simulation results to evaluate all presented methods. The 
properties of each class of method are highlighted and a systematic comparison with a state- 
of-the-art method is conducted where the quality of the identification, robustness to mea- 
surement noise and modeling errors as well as the computational complexity are evaluated. 
Chapter 8 shows an application of inverse dynamic games including the identification of 
human behavior in a haptic shared control task. Similar to Chapter 7, the experimental data 
is used to compare the methods with respect to the capability of describing observed human 
cooperative steering behavior. 


Finally, Chapter 9 sums up all insights and results obtained in this thesis. 


The structure of the thesis is summarized in Figure 1.2, where the main body is divided into 
two paths to stress the different principles which underlie the proposed inverse dynamic 
game methods. 


1.2 Outline 


Chapter 1 


Introduction 


Chapter 2 


Related Work and 
Research Gap 


Chapter 3 


Fundamentals of 
Dynamic Game Theory 


Chapter 4 


Inverse Differential Games 
Chapter 6 


Inverse Dynamic Games Based on 
Inverse Reinforcement Learning 


Chapter 5 


Inverse LQ Differential Games 


Chapter 7 


Simulations 


Chapter 8 


Application to Shared 
Control Systems 


Chapter 9 


Conclusion 


Figure 1.2: Outline of the thesis 


2 Related Work and Research Gap 


In this chapter, related work concerning methods for the estimation of cost functions is re- 
viewed and the concrete research gap is identified. The majority of related work is concerned 
with cost function identification in a single-player case, also known as inverse optimal con- 
trol, both from a control-theoretic and from a computer science point of view. Therefore, this 
case and its origins are surveyed first to provide an adequate context before covering state- 
of-the-art methods in a game-theoretical setting. The chapter ends with a discussion on all 
explored literature, the statement of the research gap and corresponding research questions 
to be answered in this thesis. 


2.1 The Inverse Problem of Optimal Control 


The problem of characterizing and describing cost functions corresponding to known optimal 
solutions was first considered in an optimal control setting, a problem which is known as in- 
verse optimal control. The study of inverse problems in optimal control started with Kalman’s 
paper: "When is a linear control system optimal?". The paper introduced conditions for a 
given linear control law to be optimal with respect to a quadratic performance index in the 
case of a single-input linear system and also showed that the inverse problem is ill-posed 
[Kal64]. Further progress was made by [Tha67] and [MA73] which stated similar conditions 
for a control-affine system and more general performance indices. These conditions serve the 
characterization of control laws which are optimal, but are not computationally convenient 
in order to calculate a particular cost function. The computational aspect was addressed in 
[JK73], where formulas were given for calculating a particular set of cost function matrices 
based on the known system dynamics and control law. Generalized results were given by 
[Cas80], where the Hamilton-Jacobi-Bellman equation was proposed as a means to calculate 
all possible cost function parameters corresponding to a known control feedback law in a 
linear-quadratic optimal control problem. Similarly, [FN84] extended Kalman’s results to the 
multivariable case and dropping the assumption of a stabilizing control law. 


After these initial efforts, inverse optimal control as a means to determine cost functions 
receded into the background in favor of the development of control synthesis methods. The 
newly introduced objective of inverse optimal control consisted in the calculation of a control 
law which is optimal with respect to any cost function, a property which is desirable due to 
the resulting robust stability of the closed-loop system. An approach was developed in [Fuj87] 
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for the linear-quadratic case. Later, [FK96, KT99] developed an approach for input-affine 
non-linear systems. Herefor, a link between optimal value functions and Control Lyapunov 
Functions was established using Sontag’s control law [Son89]. 


Nori and Frezza were the first in the automatic-control community to state a problem which 
consisted of finding a cost function which explains measured trajectories [NF04], represent- 
ing a contrast to the first theoretical work and the subsequent approaches focusing on control 
synthesis. Hence, the "inverse optimal control problem" underwent a shift towards a more 
application-oriented problem. Most following approaches which can be found under the 
name of "inverse optimal control" build upon this idea and define the problem as follows: 


Definition 2.1 (Inverse Optimal Control Problem) 

Let observed state trajectories x*(t) of a known dynamic system and control trajectories u*(t) 
of a controller be given. Determine the cost function J under which the observed trajectories 
are optimal. 


Definition 2.1 assumes the optimality of the observed trajectories, thus intuitively represent- 
ing the inverse problem to the classical optimal control problem! (illustrated in Figure 2.1). 
Nevertheless, this assumption is sometimes dropped (as e.g. in [NF04]) and therefore, the 
problem consists of estimating a cost function which best approximates a given set of trajec- 
tories. 


Optimal Control 


J(x(t), u(t)) ~ x"(t),u*(t) 


Inverse Optimal Control 


Figure 2.1: Graphical description of the inverse optimal control problem. 


Inverse optimal control has been an object of research in the last decades, both from a the- 
oretical and a practical point of view. The variety of methods for solving inverse optimal 
control problems can be classified into three main groups: 


1 Inthe course of this thesis, the latter problem shall be also referred to as forward problem to stress on the contrast 


to the introduced inverse problem. 
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1. Direct approaches 
2. Inverse Optimal Control (IOC) methods which apply control-theoretical principles 


3. Inverse Reinforcement Learning (IRL) techniques which stem from computer science 


It must be noted that the classification varies in literature. Indeed, a variety of articles use the 
term "inverse optimal control" as a term to denote the problem of estimating cost functions 
from measured data, similar to [NF04] and independently of the applied method. Neverthe- 
less, in this thesis, this classification is proposed and shall be delineated in the following. 
Almost all articles found in literature present approaches which are based on the assumption 
of a particular structure of the cost function, e.g. a quadratic cost function. Therefore, the 
problem of identifying a cost function is reduced to determining parameters 0 such that the 
observed state and control trajectories are optimal with respect to the resulting cost function 


J(9). 


The presented method classes are further described in the following. 


2.1.1 Direct Approaches 


One of the most common ways to solve the inverse optimal control problem is a direct ap- 
proach, where the cost function is determined iteratively. In each iteration, an optimal con- 
trol problem is solved in order generate the trajectories which are optimal with respect to 
the current cost function candidate. These trajectories are then compared to the observed 
ones. Based on this comparison, which usually includes the calculation of an error measure 
between trajectories, the cost function can be updated such that the error is reduced. The 
overall aim of the method is to determine cost function parameters such that the error be- 
tween both sets of trajectories is minimized. Due to the fact that the solution of the optimal 
control problem in each iteration can be represented as a "lower" level of the main optimiza- 
tion problem, these kinds of methods are also known as bilevel methods [MTL10]. Figure 
2.2 shows a schematic diagram of both levels of the direct approach: the upper level, where 
the cost function of the current iteration x is updated such that a performance measure, e.g. 
the error between trajectories is minimized, and the lower level, where an optimal control 
problem is solved to determine trajectories which are optimal with respect to the current cost 
function candidate. 


The first algorithm of this kind was presented in [MTL10] and applied for human locomotion 
modeling. Further applications of this approach include driver steering behavior modeling 
[MFH17], reach-to-grasp human motion [EHAAM16] and human leg movements [BPC* 06]. 
The implementation of the methods usually differ in the techniques for solving the upper 
level problem. For example, in [EHAAM16], the upper level problem is solved by means of 
particle swarm optimization. In [BPC*06], a static optimization version of the problem is 
posed and solved by nonlinear programming techniques. All methods require the repeated 
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solution of optimal control (or static optimization) problems in the lower level and therefore 
potentially yield large computation times. Therefore, the importance of efficient numerical 
techniques for the solution of the problems in both levels is often stressed in literature (see e.g. 
[MTL10]). As a way of mitigating the computational effort, [ARARU*11], [HSB12] and, very 
recently, [ZLH19] replace the lower level problem by its corresponding optimality conditions. 
As a consequence of the high computation times, the methods are mostly suitable for offline 
applications only. 


— Compare optimal and observed trajectories 


= J = J 


Determine optimal trajectories for J‘ 


Figure 2.2: Direct bilevel approach for inverse optimal control: The upper level updates the cost function candidate 
such that an error measure is minimized. The lower level solves an optimal control problem. 


2.1.2 Inverse Optimal Control 


This class of methods exploits results from optimal control theory and do not rely on the 
repeated solution of an optimal control problem. The methods are based on the assump- 
tion that the observed trajectories are optimal with respect to an (unknown) cost function. 
With this assumption, optimality conditions are exploited in order to develop computational 
methods to find the parameters of the cost function which explains observed data. The opti- 
mal parameters are determined by minimizing an objective function (usually called residual 
function) which describes the extent to which optimality conditions are violated. 


The variety of methods arises from the different kinds of optimality conditions which have 
been applied. In the continuous-time case, these include the minimum principle of Pontrya- 
gin’ and the resulting Hamilton differential equations [JAB13], the Euler-Lagrange equa- 
tions [AB14] and the Hamilton-Jacobi-Bellman equation [PHL14]. If time is discretized, then 
Karush-Kuhn-Tucker (KKT) conditions [KWB11, PJJB12, PR15, PR17] or the discrete-time 
minimum principle [MTFP16] can be applied. 


2 This principle was originally posed in 1955 as a maximum principle given the aim of maximizing an objective 


function (cf. [Gam99]). 
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Some work focused on the case where the dynamic system is linear and the cost function 
structure is quadratic, i.e. an inverse linear-quadratic optimal control problem. This formu- 
lation allows for exploiting the arising constant linear feedback matrix if the time horizon 
tends towards infinity. If this matrix is known, then the cost function parameters can be 
estimated by solving a linear matrix inequality [Boy94, Section 10.6] or by stating an alter- 
native objective function to be minimized with the algebraic Riccati equations as constraints 
[PCC*15, FMM*18]. 


2.1.3 Inverse Reinforcement Learning 


Finally, related problems have been tackled in the field of computer science, for which so 
called inverse reinforcement learning (IRL) techniques have been developed. The IRL prob- 
lem itself was first introduced by Russell and Ng [Rus98, NR00]. IRL mostly regards a discrete- 
time Markov Decision Process (MDP), which implies a finite and discrete set of possible con- 
trol? values and states and search for a reward function instead of a cost function.* An exam- 
ple scenario (depicted in Figure 2.3) which can be modeled with an MDP is a grid world.’ The 
inverse problem consists in finding the cost function if the agent’s trajectory from the initial 
state to the final state, or the optimal strategy, is known. Furthermore, in IRL problems, the 
strategies and the dynamics of the system are potentially stochastic. 


E 


Figure 2.3: Grid world scenario in reinforcement learning, where the aim is to find an optimal policy which leads 
to the desired final state (E). 


There is a vast number of methods which tackle the IRL problem using different principles. 
Interestingly, many of the methods under the name of IRL which are available in literature 


3 
4 


In the IRL literature, the controls are known as actions. In this thesis, both names are used as synonyms. 
Minimization of a cost function corresponds to a maximization of the reward function. The maximization prob- 
lem can be easily cast as a minimization problem by multiplying the reward function with —1. Therefore, in the 
following, the term "cost function" will be used without loss of generality. 

A grid world is the most common test scenario for (inverse) reinforcement learning methods. It describes an 
agent searching for an optimal strategy which allows him to reach the final state with least cost. In Figure 2.3, 
this implies avoiding the red blocks which denote a high cost. 
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are based on a repeated calculation of the control and state sequences based on the cur- 
rent reward function candidate, i.e. the solution of the forward problem. Therefore, the 
principle is very similar to the aforementioned bilevel method. The methods presented in 
[AN04, RBZ06, NS07] are exemplarily mentioned. The Bayesian IRL method of [RA07] uses 
maximum a-posteriori estimation of the cost function which depends on sampling methods 
and thus demands the repeated estimation of optimal controls. A widespread IRL approach 
was proposed by Ziebart et al. [ZMBD08]. The idea consists in applying the principle of 
maximum entropy introduced by Jaynes [Jay57] in order to find a least-biased probability 
function which explains the observed trajectories. 


All of the aforementioned IRL methods consider an MDP as a basis and are therefore limited 
to discrete-valued and finite states and actions. For large (or even infinite) states and action 
spaces, these methods suffer from the curse of dimensionality and become highly complex 
and computationally heavy, especially if they are applied to approximate continuous-valued 
state and action spaces. Therefore, some effort has been made to develop IRL techniques 
for continuous-valued spaces, tackling in this way a very similar problem as the literature 
on cost function identification in a control-theoretical setting. It is conspicuous that these 
approaches show a strong similarity to the maximum entropy IRL method of [ZMBD08]. 
For example, [AB11] and [HFKB15] apply a maximum entropy distribution, yet solve the 
IRL problem using a bilevel structure. On the other hand, [KPRS13] and [LK12] propose a 
maximum entropy distribution which considers continuous-valued state and action spaces 
and does not rely on the repeated solution of optimal control problems. 


2.2 Inverse Problems in Game Theory 


After reviewing literature on cost function identification in a single-player case, this section 
investigates the extent to which similar problems have been tackled in a game-theoretical 
scenario, i.e. the identification of cost functions from the observed interaction between sev- 
eral players. 


Inverse problems in game theory have received growing attention in the last years, especially 
for static games. The term inverse game theory was introduced in [SC12] to denote the esti- 
mation of the actions and cost functions of the adversary, i.e. the other players in the game, 
in order to obtain better results. Similar work is reviewed in the following. 


2.2.1 Inverse Static Games 


Even though the concept of inverse game theory initially consisted in estimating adversary 
cost functions from the point of view of a particular player, its meaning quickly became 
more general and hence, it gained a strong similarity to the previously introduced inverse 
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optimal control problems. Kuleshov and Schrijvers [KS15] introduce their paper with the 
words: "given the observed behavior of players in a game, how can we infer the utilities® that 
led to this behavior?". They consider parametrizable Bayesian games where players have in- 
complete information of the opponent’s cost function. These are estimated by using data of 
several realizations of static games. Similar conditions are needed in the approach of Kon- 
stantakopoulos et al. [KRJ*18] which leverages necessary and sufficient conditions of each 
players’ cost function to estimate their parameters. In [BGP15], a method based on the solu- 
tion of variational inequalities is presented to identify cost functions. An application of this 
work for the optimization of transportation networks is presented in [ZPCP17]. 


2.2.2 Inverse Dynamic Games 


Transferring the problem of Definition 2.1 to a multiplayer (N-player) case leads to the con- 
cept of inverse dynamic games. A general inverse dynamic game may be defined as follows: 


Definition 2.2 (General Inverse Dynamic Game) 

Let state trajectories x*(t) of a known dynamic system and control trajectories u} (t) of each 
player i, i € {1,...,N} which correspond to a solution of a dynamic game be given. Find 
the cost functions J;, for each player i, which generated the trajectories. 


In Definition 2.2, the trajectories are generated by several players in a dynamic game acting 
based on individual cost functions. In addition, the problem is also ill-posed; an evident fact 
given the ill-posedness of the single-player case. The problem of Definition 2.2 is described 
as "general" in the sense that the solution type is still unspecified and, contrary to the single- 
player case, different solution concepts exist which generally lead to different trajectories. 
If the game is non-cooperative, the solution may be a Nash or a Stackelberg equilibrium 
depending on the order in which the players act. If the game is cooperative, then usually a 
Pareto efficient solution is assumed [ER11]. Literature on dynamic game theory is mostly 
focused in the concept of Nash equilibria which naturally arises when all players minimize 
their corresponding cost functions simultaneously. However, there exists a broad class of 
dynamic games for which the Stackelberg and the Nash solutions coincide.’ 


A literature search reveals that the problem of Definition 2.2 is greatly unexplored as mostly 
special cases can be found. In the automatic control community, an early work by Fujii and 
Khargonekar gives an approach to calculate solutions of an inverse linear-quadratic differ- 
ential game [FK88] with a frequency-domain formulation. The results are similar to the one- 
player results developed by Kalman in [Kal64]. An inverse two-player zero-sum game has 


Utility is a term used especially in static game theory to denote a reward as in IRL methods. 


7 These concepts will be further explained later in Section 3.5. 
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been considered in [TMP16] where an approach which exploits necessary conditions for sad- 
dle point solutions was presented.’ In [Wan07], necessary and sufficient conditions for iden- 
tification in linear-quadratic dynamic games are given. However, these are restricted to the 
case of a second-order dynamic system and a two-player case. For N-player inverse dynamic 
game with open-loop strategies, recent results were presented in [MFP17a, MFP17b] where 
Pontryagin’s minimum principle is leveraged. In [MFP17b], a bilevel method analogous to 
the ones described in Section 2.1.1 was formulated. This is portrayed in Figure 2.4: the upper 
level, where the N cost functions (denoted by Jı.n) are updated and the lower level, where a 
dynamic game is solved to determine trajectories corresponding to the N current cost func- 
tion candidates. 


— Compare Nash equilibrium and observed trajectories 


(x) (x+1) 
T JEN De N 


Determine Nash equilibrium trajectories for J 


K) 
:N 


Figure 2.4: Direct bilevel approach for inverse dynamic games: The upper level updates the cost function candidates 
such that an error measure is minimized. The lower level solves a dynamic game to determine Nash 
equilibrium trajectories. 


Dynamic game theory has been of considerable interest in economics, leading to some pro- 
posed methods for the solution of the inverse problem in this field. For example, [BBL07] 
presented an approach which is based on the estimation of the value of the cost function by 
means of a Monte Carlo method. The work of Arcidiacono et al. [ABBE16] offers a more 
efficient method based on least-square estimation and likelihood functions. Both aforemen- 
tioned methods have the main drawback that the game is limited to discrete-valued strategies 
and a finite number of possible states. A dynamic game with a linear-quadratic setting was 
considered in [CFG89], yet restricting the players’ cost function matrices to only penalize 
their own controls and to only have diagonal entries. 


As for IRL methods, some methods which aim at extending these techniques to the multi- 
agent setting were proposed for cases in which all players behave cooperatively [HMRAD16, 
NKJ*10, SKZK17]. On the other hand, IRL-based methods in a noncooperative setting have 
been proposed in [LBC18, RGZH12]. However, similar to single-agent IRL, all of these meth- 
ods are based on an MDP and hence are limited to discrete-valued and finite control and state 


8 Zero-sum games represent the case where one player strives to minimize a cost function while the second player 


seeks to maximize the same cost function. 
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spaces. Literature shows few available work which considers continuous-valued action and 
state spaces. Two exceptions are [PSS* 16], where a cooperative scenario was considered, and 
[MHLK17], where each agent has an individual cost function, yet not explicitly relating their 
approach to game-theoretical concepts. 


2.3 Discussion 


As motivated in Chapter 1, the Nash equilibrium is a promising descriptive concept for the 
interaction between biological systems and hence potentially adequate for state-of-the-art 
applications in human-machine interaction. Therefore, this thesis focuses on the solution 
of inverse dynamic games where the trajectories correspond to a Nash equilibrium. In the 
following, the term inverse dynamic game will refer to this problem. 


In order to solve inverse dynamic games, it may appear conceivable to apply a direct bilevel 
approach analogously to the single-player case (cf. Section 2.1.1). Nevertheless, the lower- 
level problem would consist in this case in determining the state trajectories and all play- 
ers’ control trajectories corresponding to the dynamic game of the current iteration. Conse- 
quently, the method implies the repeated solution of N coupled dynamic optimization prob- 
lems for each set of cost function candidates. The first evaluation conducted in [MFP17b] 
presented a simple example where the inverse dynamic game involved the solution of 388 
forward dynamic games. Especially for non-linear dynamic games, solving for Nash equi- 
libria is in general computionally heavy and efficient numerical techniques are not avail- 
able [HdICIR19]°. Therefore, applying this approach yields a great risk of huge computation 
times. 


This motivates the need for more efficient methods for inverse dynamic games which do 
not rely on the repeated solution of a dynamic game. A fast identification of player cost 
functions allows for an immediate adaptation of automatic controllers based on potential 
new information, e.g. if the cooperating human changes its behavior. Nevertheless, until 
now, little effort has been spent in the development of alternative methods for the efficient 
solution of general N-player inverse dynamic games. Methods which stem from IRL are 
restricted to discrete-valued and finite states and controls. In addition, IRL methods in a 
multiplayer setting which consider continuous-valued states and controls are also almost 
unexplored and their theoretical foundation has not been developed. The situation is similar 
in the field of automatic control, where only special cases have been treated. Apart from 
very early work of [CFG89] in an economics-specific scenario, successful attempts to solve 
general N-player inverse dynamic games have occured only recently ([MFP17a, MFP17b]). 
This work encourages further effort in exploring alternative techniques for inverse dynamic 
games which avoid a direct bilevel approach. 


? A recent study in [HdICIR19] showed that a nonscalar two-player dynamic game with non-quadratic cost func- 


tions can take from 479.11 to 12854 seconds to solve, depending on the applied method. 
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Finally, almost all of the mentioned approaches, especially in dynamic games, concentrate 
on delivering a method which is able to estimate a cost function, but do not give further 
insight on when an estimation is possible. This not less important aspect of the properties 
of inverse dynamic game problems is almost unaddressed; there is little work on inverse 
problems in optimal control and dynamic games following the ideas of Kalman and the first 
theoretical studies (cf. Section 2.1). In addition, the ill-posedness of inverse dynamic games 
demands further attention. To date, much uncertainty exists concerning the properties of 
inverse dynamic games as these are still considerably unexplored. 


2.4 Conclusion and Research Questions 


As discussed in the previous section, the inverse problem of optimal control, i.e. a single- 
player inverse dynamic game has been investigated from both a theoretical and a compu- 
tational point of view. However, the problem of modeling and identifying the behavior of 
several players interacting with each other remains a greatly unexplored field, especially in 
the case of continuous-valued control and state spaces which is important for many applica- 
tions. The application of a direct bilevel approach to this problem is inappropriate given the 
potential complexity of solving for Nash equilibrium trajectories repeatedly. Therefore, the 
following questions need to be answered: 


— How to solve inverse dynamic games efficiently, in particular avoiding the solution of 
the forward problem? 


— Under which conditions can a solution be found and when is this solution unique? 


For this purpose, necessary fundamentals concerning dynamic game theory and the forward 
problem of determining Nash equilibria are introduced in Chapter 3 as a basis for the sub- 
sequent result. Afterwards, the posed questions are addressed in Chapters 4 and 5, where 
methods based on IOC—according to the classification in Section 2.1—are developed, and in 
Chapter 6 which presents an IRL-based method is introduced as a means to solve inverse 
dynamic games. 


Furthermore, two questions which naturally arise after the development of techniques for 
solving inverse dynamic games are: 


— How do the results of these alternative approaches compare to the results of a direct 
bilevel approach? 


— Which main class of methods, IOC-based, IRL-based or direct bilevel, yields a greater 
potential for a real application, e.g. in the identification of cooperative systems with hu- 
mans? 
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Probably due to the fact that IOC and IRL methods have been studied by different research 
communities, until now, almost no systematic comparison has been conducted on the per- 
formance of these different concepts.!° Therefore, in Chapter 7, all methods (IOC-based, 
IRL-based and bilevel methods) are compared to each other using two different major classes 
of inverse dynamic game problems, where robustness to measurement noise and cost func- 
tion modeling errors are also examined. Lastly, a first application example is presented in 
Chapter 8 to evaluate the performance of all methods with real experimental data. 


10 Two notable exceptions are given by [TZ11] and [JAB13]. The first compred bilevel and IOC-similar methods 


in (single-player) inverse static optimization. The study demonstrated that the alternative method, which was 
based on optimality conditions, yielded comparable results to the bilevel method with considerably less com- 
putational effort. In [JAB13], a single-player inverse optimal control method based on Hamilton differential 
equations was compared in simulations with the bilevel method [MTL10] and the continuous-time counterparts 
of the methods presented in [AN04] and [RBZ06]. Their proposed method was shown to perform faster and 
with less trajectory and parameter error. Nevertheless, all simulated observed trajectories were noise-free. 


3 Fundamentals of Dynamic Game Theory 


This chapter gives an overview of fundamentals of dynamic game theory. After a short intro- 
duction to the general theory of games, non-cooperative dynamic and differential games are 
introduced. Furthermore, existing solution concepts for the forward problem are introduced 
and the available means for their calculation are shown. These principles provide a basis 
for the development of the inverse dynamic game methods proposed in subsequent chapters. 
The contents of this chapter are based on the books [BO99, Eng05, HKZ12, Tad13]. 


Game theory can be defined as the theory of mathematical models of decision making to 
describe situations with conflicts and cooperation between rational players. The conflicts 
arise from different interests or goals, leading to a strong dependency of each one’s individual 
decisions. The theory emerged from the work of von Neumann [VNM47] and blossomed 
with the introduction of game equilibria by Nash [Nas51]. Since then, it has been extensively 
studied such that analytical tools are available for understanding phenomena arising from 
the interaction between decision makers. 


3.1 Introduction to Games 


One of the most frequent ways of defining a game is as a normal-form game, described in the 
following definition. 


Definition 3.1 (Game in Normal Form) 


A normal-form game is defined by 
e A set of players P = {1,2,..., N}. 
e A strategy set U; for each player i € P. 
e A set of cost functions T = {], Jo, .... In}: 


A game involves N decision makers called players which select particular actions from a pos- 
sible strategy set. These are chosen such that a specific goal, represented by their individual 
cost function, is accomplished. Definition 3.1 is very general and allows numerous kinds of 
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games which arise from different properties of the possible actions, strategy sets and cost 
functions of the players. 


If the players act in a self-interested way, i.e. they strive for minimizing their own cost func- 
tion, regardless of possible negative effects for other players, then the game is called non- 
cooperative. If the players are able to generate binding agreements and act jointly in order 
to obtain a fair result, then the game is regarded as cooperative. If the choice of actions is 
deterministic, the strategies are called pure strategies. The converse is denoted as stochastic 
or mixed strategies. Moreover, games may be finite or infinite, depending on the strategy set 
U; of each player. If the set of possible strategies U; has a finite number of elements for 
all players, the game is said to be finite. Otherwise, if U; is infinite for at least one player, 
i.e. an infinite number of possible strategies is available for at least one player, the game is 
infinite. 


An important classification of games is based on the number of times a player can choose an 
action. If the players act only once and independently of each other, the game is static. As 
soon as one player is allowed to act in several time stages based on new information resulting 
from other players’ previous actions, then the game is dynamic. Therefore, in dynamic games, 
time plays an important role. The evolution of an infinite dynamic game is naturally described 
with a difference equation in a discrete-time formulation based on the stages or discrete time 
steps in which players take action. However, a continuous-time formulation is possible as 
well, which is also known in literature as a differential game. 


The results of this thesis are based on non-cooperative infinite dynamic games in both 
discrete and continuous time. Since many results are analogous and comparable, the main 
aspects of infinite dynamic games will be shown and formalized in this chapter with a for- 
mulation in continuous time. Analogous definitions for the discrete-time case can be found 
in Appendix A. 


3.2 Differential Games 


The evolution of a differential game depends on the strategies of all players. It can be de- 
scribed by means of the time-dependent state trajectories of a dynamic system defined by 
differential equations. 
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Definition 3.2 (Dynamic System in State Space Representation) 


A dynamic system is defined by ordinary differential equations and an initial condition 
given by 


x(t) = f (x(t), un(t),...,un(t),t) (3.1a) 
x(0) = Xo, (3.1b) 


where x(t) € R” andu;(t) € R™, i € P, denote the system state vector and the control 
vector of player i at time step t, respectively. Furthermore, f : R” x R™ x... x R™N x 
Rj > R” is a vector function which is continuous int € [0, T] and globally Lipschitz in x, 


Uj,...,UN.- 


The evolution of the differential game is regarded for a time interval [0, T] which represents 
the duration of the game. The vector xo represents the initial state of the system. The final 
time T could be T — œ or a fixed value depending on the given problem. Lipschitz conti- 
nuity of f is required to ensure that the initial value problem (3.1) admits a unique solution 
for every N-tuple (uı(t),...,un(t)) of continuous controls u;(t), i € P. Each player ie P 
acts upon the system in Definition 3.2 by applying a corresponding input or control trajec- 
tory u,(t), Vt € [0,T] which belongs to an action space U;. Each player’s control decision 
or strategy, denoted by y,, is based on the state information available to them which is 
represented by a set-valued function n;(t).'! The strategy is chosen from a set of available 
strategies T; and defines a particular control trajectory u;(t)", ie. 


uilt) = yi), t), y; ET. (3.2) 


The strategy and consequently, the control trajectories are determined according to an indi- 
vidual cost function 


T 
Ji = hi (x(T), T) +f gi (x(t), ui(t),...,un(t),t) dt, (3.3) 


where h; denotes costs which arise from the final state or final time and g; represents running 
costs which arise for t € [0, T]. The aim of each player i is to minimize the cost function (3.3) 
by applying appropriate controls u;(t). This objective is described by the dynamic optimiza- 
tion problem 


Different possibilities of player state information and corresponding strategies will be examined later in Sections 
3.3 and 3.4. 

In the context of dynamic games, actions and strategies are different and have this relationship. On the contrary, 
in static games these are identical and the terms are therefore not distinguished. 
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n Ji (x(t), ui(t), u~:(t), t) 


w.r.t. 


x(t) = f (x(t), u(t), u-i(t),t) 
x(0) = x9 


(3.4) 


where ~i is used as a shorthand notation for "all except i". Therefore, u_;(t) denotes the input 
trajectories of all players except player i.!? As a result, differential games can be described as 
N coupled dynamic optimization problems. 


To summarize, a definition of differential games which will be used throughout this thesis is 
given. 


Definition 3.3 (Differential Game) 
A differential game is defined by 


e A set of players P = {1,2,..., N}, 
e A specified time interval [0, T] denoting the duration of the game, 
e An infinite action set U;, Vie P, 


e A set-valued function n;(t), Yi € P, which determines the state information of player 
i at time t, 


e A dynamic system given by Definition 3.2, 


e A set of cost functions J = {], Jz, .... Jn}. 


3.3 Information Structures 


A relevant characteristic of a differential game is the available information for all players at 
each time step t. The information set is described by 


nit) € P-o ({xo, x(s), x(t)}), SE [0, Kit Xi,t € [0, t], (3.5) 


13 The importance of the uniqueness of the solution of (3.1) for every N-tuple (u1, ..., um) becomes clear at 
this point. Non-uniqueness is clearly not allowed in a differential game since it would potentially lead to non- 


uniqueness in the value of the cost functions for a single N-tuple of control trajectories. 
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where P_9(-) denotes a power set which excludes the empty set and x;,; is non-decreasing 
in t. In a particular time t € [0,T], player i has knowledge of current or past values of the 
state x. By (3.5), it is possible to describe a variety of information structures which are very 
common in dynamic game theory. Sometimes partial state information is assumed instead 
of a complete state information as implied by (3.5) and as considered in this thesis. The next 
definition lists concrete information structures which shall be focused on in the following. 


Definition 3.4 (Information Structure of the Players) 


The information structure of player i is said to be 
(i) open-loop (OL) pattern if i(t) = {xo}, t € [0, T]. 
(ii) memoryless perfect state (MPS) pattern if ni(t) = {xo, x(t)}, t € [0, T]. 


(iii) feedback (FB) pattern ifn;(t) = {x(t)}, t € [0, T]. 


The open-loop information pattern describes the situation where all players decide at t = 0 
the control trajectories u;(t) to be applied for t € [0, T] based solely on the initial system 
state value xo. The control decision remains unchanged for the whole duration of the game, 
regardless of any possible disturbance on the states. Figure 3.1 shows a graphical represen- 
tation of a differential game with an open-loop information structure for each player. 


Xo 


yı(xo, t) 


(xo, t) 
i 


Player N 


Figure 3.1: Differential game with an open-loop information structure. 


Yn (xo, t) 


In case of a memoryless perfect state pattern, the players have information of the initial 
state xo and the current state x(t). The inclusion of the initial state becomes necessary for 
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solving differential games where some of the players have an OL information pattern and 
others have access to the states x(t). In this thesis, the converse case—equal information 
patterns for all players—is considered such that a feedback information pattern can be used 
equivalently.!? These last two information structures imply "closing the loop" in a control- 
theoretical sense. The resulting multiplayer control loop for a feedback information structure 
is exemplarily depicted in Figure 3.2. 


Y (x(t), t) 


Player N 


Figure 3.2: Differential game with a feedback information structure. 


The different information patterns lead to various kinds of strategies selected by the players, 
each of which leads to a particular solution of the differential game, i.e. resulting state and 
control trajectories. 


3.4 Strategies 


As mentioned previously, the strategy defines the controls of the players based on the infor- 
mation available to them. Therefore, for each information structure defined above, we obtain 
a different class of strategy. The next definitions specify the corresponding strategy classes 
to the open-loop and the feedback information patterns. 


14 The later defined Nash equilibrium solution (cf. Section 3.5.1) is identical under both MPS and FB information 


patterns since the equilibrium dependence on xp is given only for the initial time t = 0. Therefore, these infor- 
mation patterns can be considered as equivalent in this sense [BO99, p. 278]. For this reason, in the following 
only the OL and the FB information patterns shall be considered. 
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Definition 3.5 (Open-Loop Strategy) 


An open-loop strategy y; for player i € P selects a control action according to 
uj(t) = y,(xo,t), Vxo € R”, Vt € [0,T], (3.6) 


where y is a continuous function in t and defined for each possible initial state xo. The set 
Le 
t 


of all such possible strategies is denoted by 


Definition 3.6 (Feedback Strategy) 
A feedback strategy y; for player i € P selects a control action according to 


u;(t) = y,(x(t),t), Wt € [0,T], (3.7) 


where y is continuous int and globally Lipschitz in x. The set of all such possible strategies 
is denoted by T™. 


An open-loop strategy describes the situation where all players decide at t = 0 the control 
trajectories u;(t) to be applied for t € [0,T] based solely on the initial state value xo of the 
dynamic system. The control decision remains unchanged for the whole duration of the 
game, regardless of any possible disturbance on the states. The feedback strategy implies 
that the players define their actions based on the current state x(t). Therefore, each player 
commits to a particular reaction to the information concerning the state of the system. 


These strategy types are the basis for the solution of differential games. In the following, 
different solution concepts are presented. 


3.5 Solution Concepts in Differential Games 


A differential game may have different outcomes depending on its properties. The main 
difference arises from the cooperative or non-cooperative nature of the interacting players. In 
a non-cooperative game, all players act strictly rationally in order to minimize their own cost 
function, regardless of the detriment this may cause to other players. In this kind of game, the 
most common solution concepts are described as game-theoretical equilibria. These are the 
so-called Nash equilibrium [Nas51] and the Stackelberg equilibrium [Sta52]. In turn, in 
cooperative differential games, players are able to cooperate and make agreements such that 
they can (potentially better) achieve their objectives. In this kind of games, Pareto efficient 
solutions [Par14] are mostly sought. 
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3.5.1 Non-Cooperative Games 
Nash Equilibrium 


The Nash equilibrium is a solution concept in game theory which arises if (i) all players act 
simultaneously and optimally with respect to their own cost function and their beliefs of 
the other players’ strategies and (ii) these beliefs are correct!°. An alternative, equivalent 
definition is the following: For each player, there is no other feasible input strategy than the 
current, optimal one, that would minimize his own costs, taking into account all the other 
players with their optimal input strategy [Nas51]. In other words, it is not possible for all 
players to obtain a lower value of the cost function by solely altering their individual strategy. 
A formal definition is given in the following: 


Definition 3.7 (Nash Equilibrium) 
A Nash equilibrium is described by the N-tuple of strategies y* := (y;,.... yy), with y; € 
T/,ie P,o € {OL, FB}, which satisfies 


Iyi yi) <Silypyii), VIEP, 


Le. y; = u}(t), t € [0, T] is the optimal input strategy for each player i considering optimal 
input strategies of all other players y*,. The resulting tuple of control trajectories u* := 
(u;(t), a uX,(t)) is called Nash equilibrium solution. 


Definition 3.7 describes either an open-loop Nash equilibrium (OLNE) or a feedback 
Nash equilibrium (FNE), depending on the kind of strategy which is applied by each player, 
ie. whether the strategy set I; is given by TO! or I¥8, respectively. The corresponding state 
trajectories x*(t) are determined by solving the initial value problem (3.1) using the control 
trajectories (už (t), ....u,(t)). The OLNE has the property of being a weakly time consistent 
solution. This means that the players do not have any incentive of deviating from their 
strategy during the game, ie. at any time step tı € [0,T]. On the other hand, the FNE is 
strongly time consistent'®, which means that their strategy y} is still an equilibrium strategy 
if it was applied from any time t; € [0, T] and starting from any arbitrarily chosen state x(tı) 
off the original equilibrium path (which is reachable from x(0)). This makes the feedback 
Nash equilibrium more robust towards any possible disturbances on the system state. 


In a differential game, there may exist no Nash equilibria. Moreover, a single or multiple Nash 
equilibria may also exist. Furthermore, a Nash equilibrium cannot be uniquely associated to 
a set of cost functions J. This fact is of particular importance for the inverse differential 
game problem and will be discussed in Section 4.1 of the next chapter. 


15 
16 


An example of this is a situation where all cost functions are made public to all players [OR94, p. 14]. 
Also called subgame perfect, see. e.g. [Eng05, Definition 8.2]. 
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Stackelberg Solutions 


Previously, it was assumed that the players select their strategies simultaneously. A scenario, 
where the players select their strategies one after the other can lead to a different outcome of 
the game. Such a setting was first introduced by von Stackelberg in the context of a duopoly 
output game [Sta52]. In a general N-player situation, one of the players is selected as a leader 
such that he announces his selected control strategy. Afterwards, the next player uses this 
information to make a decision on his own strategy such that his cost function is minimized. 
This process continues until player N chooses its strategy based on the announcements of 
the other N - 1 players’ strategies. Stackelberg solutions are mostly considered in economic 
applications, e.g. market models, and are typically defined in a 2-player setting (cf. [CC72]). 


Definition 3.8 (Stackelberg Strategy) 


The strategy tuple y* = (yi, y3) is called a Stackelberg strategy with player 1 as leader and 
player 2 as follower if for ally, € T; 


Aiya) < ily. yay) (3.8) 


where y5(y,) € T, denotes the optimal response of player 2 to a fixed strategy of player 1, 
Le. 


Ey ¥2(¥1)) = min JY Y2) (3.9) 


andy; = y3(y3)- 


The Stackelberg strategy is an attractive strategy when the information pattern is biased or 
asymmetric. This means that player 1 does not know the cost function of player 2, but player 
2 has knowledge of both cost functions. This is the case in a market model where there is a 
dominant company. The leader has an advantage in terms of the possibility to obtain better 
results due to the fact that he is aware that the rest of the players will act optimally based on 
whatever strategy he may apply. 


First derivations of Stackelberg solutions for dynamic games were given e.g. in [CC72, Med78]. 
For the (continuous-time) differential game case with N players, [Rub06, Proposition 2.3] 
states that the Stackelberg solution coincides with the feedback Nash equilibrium solution 
—provided it exists—if and only if (i) the running costs g; depend solely on the state x and 
each player’s controls uj, ie. 


gi (x(t), wit), ui(t)) = gi(x(t), ui(t)), (3.10) 
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and (ii) the dynamics of the state depend, at the most, linearly on each player’s controls, i.e. 
the system dynamics have the control-affine form 


N 
X(t) = f(x), + >) Gila, Dult). (3.11) 
i=1 


3.5.2 Cooperative Games 


Contrary to the non-cooperative case, a cooperative game includes players which not only 
seek the optimization of their own objectives but also consider the objectives of the other 
players in the selection of the control actions. Hence, it is assumed that they cooperate in 
order to achieve their objectives.'’? However, no side-payments take place, which means that 
their cooperative behavior is not explicitely rewarded by introducing a cost-lowering term 
in the objective function. Consequently, depending on how the players decide to distribute 
their efforts, several possible minima exist for each particular player i € P. 


In the field of cooperative games, the concept of dominating strategies plays an important 
role. A strategy tuple y(,) will dominate another strategy tuple yq) if the application of y,,) 
leads to lower costs for all players compared to yp}. Therefore, dominating strategies lead to 
a better result for all players. This line of thought motivates considering only solutions that 
are such that they cannot be improved by all players simultaneously and leads to the concept 
of Pareto efficient solutions. 


Pareto Efficient Solutions 


A Pareto efficient solution is a combination of strategies such that it is not possible to obtain 
a better result in terms of the own cost function of each player without affecting the result 
of other players negatively. This means that, while it may be possible for individual players 
to improve their own result by changing their own action unilaterally, this would lead to a 
worse result for at least one of the other players. A Pareto efficient solution is defined as 
follows [Eng05, Definition 6.1]: 


17 Nevertheless, coalitional games, where several groups of players may build coalitions to act non-cooperatively 


with respect to other ones, are excluded in this thesis. See the definitions given in [ER11]. 
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Definition 3.9 (Pareto Efficient Solution of a Differential Game) 
An N-tuple of strategies y? = (17. fies F is a Pareto efficient solution (PES) of a differ- 
ential game if no other feasible tuple y = (y,,..., Yy) exists for which 


Jily) < Gy”) (3.12) 


for at least one j € P and 


Jily) < Jilly’), VieP,i#j. (3.13) 


Definition 3.9 states that a PES is a combination of strategies such that it is not possible that 
any player obtains a lower value of his cost function by deviating from the strategy without 
affecting at least one other player negatively. Therefore, Pareto optima do not represent 
a stable solution of a non-cooperative game, since in such a game each player strives for 
minimization of their own cost function. A non-cooperative player will deviate from the 
Pareto strategy if this implies a lower value of his cost function, regardless of the resulting 
drawback for other players. 


3.6 Calculation of Differential Game Solutions 


This thesis focuses on the Nash equilibrium and on Pareto efficient solutions of differential 
games. Therefore, in the following, the relevant means for calculating these solutions are 
presented. 


3.6.1 Open-Loop Nash Equilibrium 


The basis of the calculation of Nash equilibria is Definition 3.7. The inequality implies that 
the optimal strategy ył € Re leads to a control trajectory už (t) which minimizes the cost 
function J(u;(t), uX ,(t)) subject to the system dynamics 


x(t) = f (x(t), ui(t), ul), t), (3.14) 


i.e. the system dynamics with the optimal controls of the other players j € P, j + i. There- 
fore, we obtain an optimal control problem for player i since u* ‚(t) does not depend on u;(t). 
Hence, the tools of classical optimal control can be applied. In particular, Pontryagin’s min- 
imum principle (see e.g. [Nai03, Chapter 6]) can be used to determine a set of differential 
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equations which represent necessary conditions for Nash equilibria. As in optimal control, 
the analysis of differential games is based on the Hamiltonian function 


H,(p;(t), x(t), wit), uilt), t) = gi (x(t), w(t), url), +p; (E) f (x(t), wit), u-ilt),t) (3.15) 


for all t € [0, T] and all players i € P, where y; : [0, T] > R” are so-called costate functions 
or Lagrangian multiplier functions. Given the case of an open-loop information structure 
and corresponding strategies as defined in Definition 3.5, the equilibrium is said to be an 
open-loop Nash equilibrium. The following theorem gives necessary conditions for such 
equilibria. 


Theorem 3.1 (Necessary Conditions for Open-Loop Nash Equilibria) 

For an N-player differential game of fixed duration [0,T], let f(x,uy,...,un,t), 
gi(x, u1, ..., Uy, t) and h;(x(T),T) be continuously differentiable with respect to x for all 
t € [0,T],i € P. 

Then, if y% = (yio, t), Yry(X0, t)), where y; € Tor and y;(xo,t) = u;(t), i € P, 
provides an open-loop Nash equilibrium (OLNE) solution with x*(t) as the corresponding 
state trajectory, the trajectories of the N costate functions y ;(t), i € P, satisfy the relations: 


x(t) = f(x (t), u(t), UN(E), t), x" (0) = xo (3.16a) 

u;(t) = arg min H; (,(t), x*(t), ui(t), už (t), t) (3.16b) 
u;(t) 

p(t) = -V xH; (w(t), x*(t), u;(t), uža), t) (3.16c) 

p(T) = Vxhi(x"(T),t), (3.16d) 


where V x denotes the partial derivative with respect to the state variable x. 


Proof: 
See the proof of Theorem 6.11 of [BO99]. o 


The set of differential equations (3.16) have to be fulfilled for all open-loop Nash equilibria 
and is valid for the general case where u; is constrained. In case the optimal controls lie 
strictly inside the set defining the constraints or if we have unconstrained controls u; € R™ 
as considered in Definition 3.2, the control equation (3.16b) leads to 


0= Vu, ily, (t), x*(t), ui(t), us ;(t), t), (3.17) 


where V,,, denotes the partial derivative with respect to u;. Therefore, with the application of 
Theorem 3.1 we obtain a set of coupled differential equations. Under some further assump- 
tions including the cost functions being decoupled with respect to each player’s controls, 
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i.e. (3.10) holds, and the system dynamics having the form (3.11), it is possible to formulate 
a two-point boundary value problem (TPBVP), generally consisting of (N + 1)n ODEs and 
(N + 1)n boundary conditions which can potentially be solved using numerical methods, e.g. 
shooting techniques [AMR95, Chapter 4]. Further details are given in Section B.3 of the Ap- 
pendix. Note that the minimum principle of Pontryagin and therefore Theorem 3.1 represents 
only necessary conditions for Nash equilibria. It generates candidates for OLNE solutions 
but there is no guarantee that they are indeed a Nash equilibrium. However, under further 
assumptions, the minimum principle becomes a sufficient condition for optimality. There- 
fore, following [Doc00, Theorem 3.2], it can be stated that if H; (w,(t), x(t), ui(t), u-i(t), t) 
is convex in x and also continuously differentiable in x, and furthermore h; is convex, then 
the controls uf(t) are optimal with respect to each corresponding optimization problem and 
hence describe an OLNE. 


In the following, an example is given to illustrate the procedure of calculating an OLNE by 
means of Theorem 3.1. 


Example 3.1: 


We consider a scenario consisting of two players controlling a system given by 
x(t) = —x(t) + u(t) + u(t). (3.18) 


Each player acts based on the cost function 


oo 


a 1 2 1 2 y 
Ji = [5 (t)+ Suj(dt, ie {1,2}. (3.19) 


0 


In the following, i and j are used to denote any player from the set P = {1,2} such that i + j. 
Furthermore, time dependencies are omitted for brevity. 


To determine the OLNE, we first determine the Hamiltonian of each player: 


1 1 

H; = En + zu +yi(-x+ui +u;), ife {1,2}, 14). (3.20) 
We now can utilize the necessary conditions for open-loop Nash equilibria given by Theo- 
rem 3.1. The control equation (3.16b) leads to 


OH; 
Ou; 


H=ut+yj=0e uj = -Yj. (3.21) 
From (3.16c) we obtain the differential equation 


h=-— = =x + fi. (3.22) 
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Furthermore, the system dynamics equation (3.16a) given by 
x= =x + u + u2 (3.23) 


must hold as well. 


By combining (3.21), (3.22) and (3.23) we obtain the linear system of differential equations 


x -1 -1 -1||x 
v\=|-1 1 O} |i}. (3.24) 
Yo -1 0 1 | [ya 


Given that optimal control and differential game problems usually specifiy initial conditions 
for the state vector and terminal conditions for the costates yi, this system of differential 
equations represents a TPBVP. In this case, it can be solved both analytically and numerically. 
The general analytical solution can be determined e.g. by the eigenvalue and eigenvector 
method (see e.g. [HS14, Section 5.3]) and results in 


x*(t) = C,(V3 - 1) exp (-V3t) + Cz(1 — V3) exp (V3t), (3.25) 
W(t) = Cy exp (—V3t) + C2 exp (V3t) — C3 exp (t), (3.26) 
W(t) = Cı exp (—V3t) + C2 exp (V3t) + C3 exp (t), (3.27) 


where the constants C4, 1 € {1,....3} are determined by using the aforementioned bound- 
ary conditions for states and costates. The OLNE solution results directly from the costate 
functions (3.26) and (3.27). Finally, we recognize that in this example the conditions of The- 
orem 3.1 are both necessary and sufficient. 


3.6.2 Feedback Nash Equilibrium 


Consider a differential game where the players apply a feedback strategy as in Definition 
3.6. By applying the minimum principle, similar equations to the ones of Theorem 3.1 result. 
Nevertheless, instead of (3.16c), the equation 


P(t) = -V x Hilpi C), x (t), uE), ya, t), t) (3.28) 


holds. The time dependency of the state in the strategies y*, is dropped here and in the 
following for brevity. In this new costate equation, the controls u? ‚(t) = y*;(x*, t) have an 
influence on the partial derivative in (3.16c) since, contrary to the open-loop case, they now 
depend on the current value of x(t). Even though these new equations define a closed-loop 
no-memory Nash equilibrium, they are not computationally convenient [SH69a]. Further- 
more, there is in general an uncountable number of solutions to the resulting differential 
equations, one of which is the open-loop solution determined in (3.16) [BO99, p. 277]. 
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In order to eliminate this so-called "informational non-uniqueness", the concept of feedback 
Nash equilibria is introduced. This refinement states that if an N-tuple of strategies y* = 
(yi, ae yn) constitutes a FNE solution of a differential game with duration [0, T], then its 
restriction to the time interval [t, T], for any t € [0,T], describes a FNE solution for the 
same differential game defined on this shorter time interval [¢,T]. A consequence of this 
requirement is the strong time consistency of FNE solutions (cf. Section 3.5.1). Furthermore, 
any FNE also fufills the equations of Theorem 3.1 with the costate equation (3.28). 


The core of the results concerning feedback Nash equilibria is given by N coupled Hamilton- 
Jacobi-Bellman (HJB) equations for which the value function, known from optimal control, 
is extended to the N-player case. 


Definition 3.10 (Value Function) 


Consider a player i € P. Let the optimal strategies of the other players y* ; associated to an 
N-player non-cooperative differential game be given. The value function V; : R"x[0,T] > 
R of player i is defined by 


T 
Vi(x, t) = min fo (xi(s), y (x, s), y% 8), s) ds + h;(x(T), T) (3.29) 
{y ,(x,s), t<s<T} J 
T 
Vi(x, t) = Jawo. yž (æ, s), y* (x, s), s) ds E 


t 


satisfying the boundary condition 
Vi(x, T) = h,(x, T), (3.31) 


and where 


x;(s) = Ff (xi(s), p(x, s), ye (x, s)); x(t) = x. (3.32) 


The value function V;, i € P represents the minimum cost-to-go from any initial state x and 
any initial time t which is attainable by player i, where the optimal strategies of the other 
N - 1 players are fixed. With this definition, the following theorem can be stated. 
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Theorem 3.2 (Sufficient Conditions for Feedback Nash Equilibria) 


For an N-player differential game of prescribed fixed duration [0,T], an N-tuple of feed- 
back strategies y™® = (yi. eva) where y; € me and y;(x,t) = u;(t), ie P, provides 
a feedback Nash equilibrium (FNE) solution if there exist continuous differentiable value 
functions V; according to Defintion 3.10 which satisfy the partial differential equations 


OV; (x, t y ~ $ 
-D L min [VeVi HF), + FEO, u), t) 
a * x% * 3.33 
= VeVi DFO, yid +F, 9 
Vi(x,T) = hi(x, T), iEP, 
where 
Ff i(x(t), wilt), t) = FE, y, t), u(t), t), eas 
Gj (x(t), ui(t), t) = gi(x(t), yo t), ui(t), t). 
The corresponding Nash equilibrium cost for player i is V;(xo, 0). 
Proof: 
See the proof of Theorem 6.16 of [BO99]. oO 


The following example illustrates the use of Theorem 3.2 to determine a FNE solution of a 
differential game. 


Example 3.2: 


Consider the differential game with 2 players from Example 3.1, where they control a sys- 
tem with dynamics (3.18) and each of them chooses his actions such that his individual cost 
function (3.19) is minimized. However, contrary to last example, each of the players applies 
a feedback strategy according to Definition 3.6. Again, function dependencies are neglected 
for brevity, unless a variable dependence demands special attention. 


Given time-independent functions g;(x, ui, u~i) and system dynamics as well as the infinite 
horizon (T — oo), the value function also does not depend explicitely on time (cf. [HKZ12, 
Remark 7.5]), and therefore the HJB equation of each player results in 


. [1.2.12 OV; 
0=min|-x" + -uj + 
Ox 


=x + ui + uj i € {1,2}, i £j. 3.35 
nin |, [-x Uj uj||, i¢€ {1,2}, 147 (3.35) 


Minimizing the expression at the right hand side leads to 


ee Deu; =- a yi (x). (3.36) 
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At this point, it is usually necessary to guess the structure of the value function V;. Given the 
linear system dynamics and the quadratic cost function, we hypothesize a quadratic value 
function. Moreover, given the symmetric structure!®of the game, we are interested in sym- 
metrical equilibrium actions u; = u; leading to identical value functions. 


For any player i € {1,2}, we write the value function as 


i 


Ox 


A 
V;(x) = ae +Bx+C e =Ax+B (3.37) 


with A,B,C € R. By using (3.36) and (3.37), the HJB equation (3.35) leads after some sim- 
plification to 


3 1 3 
0 =|-~A? — A+ -| x? — (3AB + B) x — B°. (3.38) 

2 2 2 
By comparing both equation sides we obtain B = 0 and two possible values Ay = -1, 
A = 3. Given the positive integrand in (3.19), the value function must be positive and 


therefore, A; is discarded. With (3.36) and (3.37) we obtain the optimal feedback strategy 
i 1 
yi (x) = —3 tt) (3.39) 
and the corresponding state trajectory 
x 5 
x"(t) = Cexp (C35) (3.40) 


where C € R is determined by using an initial state condition x(0) = xo € R. 


3.6.3 Pareto Efficient Solutions 


In general, a dynamic game has various Pareto efficient solutions. The set of all of these solu- 
tions is called Pareto frontier. In the following, a theorem presenting necessary and sufficient 
conditions for Pareto efficient solutions is given. 


18 Here, the notion of symmetry of [Doc00, p. 106] is considered, meaning that all players (usually two) have the 


same cost function J; and control space U;. Furthermore, the system dynamics are symmetric with respect to 
the players in the sense that the equation is unaffected if e.g. u; is interchanged with up. 
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Theorem 3.3 (Necessary and Sufficient Conditions for Pareto Efficient Solu- 
tions) 
Letti > 0, for alli € P, satisfy 
N 
> t= 1. (3.41) 


i=1 


Now consider an N-player differential game. If y? = (xt. ag yh] is such that 


N 
= i iJi 
y arg min ) Ti Jily) 
wrt (3.42) 
x = f (x(t), u(t), ...un(t), t) 


x(0) = Xo 


then y? is a Pareto efficient solution (PES). Moreover, if the strategy spaces T; are convex 
and J; are convex for alli € P, then for all Pareto-efficient solutions y? there exist t; such 
that y? solves the optimization problem (3.42). 


Proof: 
The theorem can be found in [Eng05, Theorem 6.4]. The sufficiency result is proved in [Eng05, 
Lemma 6.1] while the necessary part is proved in [Eng05, Lemma 6.3]. Oo 


The formulation of Theorem 3.3 as a dynamic optimization problem allows the use of the 
minimum principle to solve for the PES. The solution can sometimes be given with 7; as a 
degree of freedom. Weighting parameters which fulfill (3.41) can also be chosen to find a 
particular PES, e.g. with 7; = 1/N. 


In the following, an example is given to illustrate the calculation of a PES. 


Example 3.3: 


Consider the differential game with two players from Example 3.1. In this example, we assume 
the players are able to build cooperative strategies such that their overall performance is 
increased. 


We choose t, = T and T2 = 1 — t and state the cost function 


Ip =th +1- T) (3.43) 
T 
1 1 
= / aa + Au — us) + 5% dt. (3.44) 


0 
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We now can utilize the minimum principle to determine the solution. The Hamiltonian which 
corresponds to Jp is given by 
2 
ee 2a oaa 2 
H, = 5% + ze -1)+ 5 + Vp (=x + uy + u2). (3.45) 
T 
Since there is a coordination between both players, we consider the vector u, = [u u] as 
the overall control vector. The control equation 


OH, 
P= | Bp: |o (3.46) 
uy uz(1 = T) + Yp 
of the minimum principle leads to 
1 1 
u=--% and u=- Vr- (3.47) 


Wp = -= = =x + Yp (3.48) 
and the system dynamics equation 
*ž = -x + u + u2 (3.49) 


must hold for the optimal solution. 


Similar to Example 3.1, by inserting (3.47) into (3.49) and using (3.48), we obtain a system of 


Poo om 


which can be solved analytically using the eigenvalue and eigenvector method. The general 


differential equations 


solution is 


x*(t) = Cy(A + 1) exp (-At) — C2(A — 1) exp (At) (3.51) 
p(t) = Ci exp (-At) + Cy exp (At), (3.52) 
with 
1 
Az=./1+ eS (3.53) 


and where C; € R, l € {1,2} are determined using initial and terminal conditions. 
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3.6.4 Comparison of Solution Concepts 


In general, the OLNE and FNE are not equal since they are based on different assumptions 
concerning the available information to the players. Furthermore, while there are some cases 
where Nash equilibria and Pareto efficient solutions coincide, this is also generally not the 
case. In order to illustrate the difference between the solutions, the following example is 


presented. 


Example 3.4: 

Consider the same two-player differential game as in Examples 3.1, 3.2 and 3.3. In the three 
examples, the OLNE, FNE and the PES were calculated, respectively. In this example, the exact 
trajectories which follow from t = 0.5 and the boundary conditions 


x(0)=2, p(T => œ)=0, p(T —œ)=0 and p(T —œ)=0 (3.54) 


were determined analytically using MATLAB’s dsolve. Figure 3.3 shows state and control 
trajectories of the differential game defined by (3.18) and (3.19). Only one control trajectory 
is shown for each solution concept since the symmetry of the game leads to equal controls for 
both players. While the OLNE and FNE are similar to each other, the PES differs considerably 


more. 


a) 


— OLNE — FNE — PES | | 


u;(t) 


tins 


Figure 3.3: Open-loop Nash equilibrium, feedback Nash equilibrium and Pareto efficient solution of an example 
two-player differential game: a) State trajectories, b) Control trajectories. 
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Finally, in order to show that a cooperative differential game with a PES leads to a better 
outcome than a non-cooperative setting, we calculate the value of the objective function for 
each solution concept: 


Frou = 0.655 (3.55) 
Ju rang = 0.667 (3.56) 
Jupes = 0.618, i € {1,2}. (3.57) 


Hence, J; tene 2 J} orne Z Ji ppg holds. The lower costs of the PES demonstrates an advantage 
of acting cooperatively in this example. 


3.7 Tractable Differential Games 


The solution of the coupled differential equations which arise from the necessary and suffi- 
cient conditions for Nash equilibria is in general not a trivial task, especially concerning the 
partial differential equations (HJB equations) which are needed to find an FNE. Indeed, finding 
Nash equilibria for general differential games is nontrivial and an object of current research. 
To find an FNE in nonlinear differential games, approximative or iterative solutions of the 
HJB equations are sought and therefore, the use of reinforcement learning or adaptive dynamic 
programming techniques are obtaining increased interest [KKD14, ZZWZ16, KVML18]. 


There are particular kinds of differential games which are similar to the examples presented 
in the previous subsections in the sense that the calculation of Nash equilibria is considerably 
simplified. These are therefore called tractable differential games [HKZ12, Section 7.6] and 
include 


- linear-quadratic differential games 
- linear-state differential games 


- exponential differential games. 


These kinds of differential games are treated e.g. in [DFJ85] and [Doc00, Chapter 7]. 


One of the structures considered in this thesis are linear-quadratic differential games, as it is 
an important and widespread class of differential games which has been used in several appli- 
cations of automatic control including driver assistance systems [FFH17], collision avoidance 
[MSA17], control of mobile robots [Gu08] and control of energy grids [ZMSFZ16]. Therefore, 
the following section presents the most important results which are known for this particular 
class of games. 
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3.8 Linear-Quadratic Differential Games 


A linear-quadratic (LQ) differential game is a class of differential games where the system 
the players control simultaneously has linear dynamics, ie. the evolution of the states is 
governed by a system of linear differential equations. Furthermore, the players act based 
upon an individual quadratic cost function. This kind of games can therefore be seen as an 
extension of linear-quadratic optimal control to the N-player case. LQ differential games are 
considered a class of differential games which can be solved with reasonable effort. Their 
particular structure allows the derivation of necessary and sufficient conditions for Nash 
equilibria which are computationally tractable. 


Definition 3.11 (Linear-Quadratic Differential Game) 


A linear-quadratic (LQ) differential game is defined by the same elements as Definition 3.3. 
The system dynamics are linear, i.e. are defined by 


N 
x(t) = Ax(t) + > B,u,(t), (3.58) 
i=1 


where x(t) € R”, u(t) € R™ and A and B;, i € P, are the system and control matrices of 
appropriate dimensions, respectively, which form stabilizable matrix pairs (A, Bi), i € P. 
Furthermore, the cost functions are quadratic, i.e. 


T 


1 1 a 
Ji = 5x DQ; rx(T) + 5 [Wax + 2, u; (t)Riju;(t) di, (3.59) 


0 


where Q; r, Qi, Rij are symmetric matrices for alli, j € P and Ri; > 0. 


The constraint of positive definiteness R;; > 0 is required in order to guarantee a meaning- 
ful minimization problem. Additional positive-semidefiniteness constraints are sometimes 
introduced, e.g. Q; r, Q; = 0. These are often convenient to obtain Nash equilibrium so- 
lutions but are not always strictly necessary, as will be discussed in the next subsection.!? 
Furthermore, the stabilizable pairs (A, B;), i € P, imply that each player is able to stabilize 
the system on its own, a fact that is required for the following results on Nash equilibria in 
LQ differential games. 


19 A widespread case is given by a two-player differential game N = 2 where the players play in a stringent 


adversarial way. This is represented by cost function matrices Q, = -Q), Qo 7 = -Qi r. Ri2 = -R22 
R21 = -Rıı and is known as zero-sum differential game [SH69b]. 
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3.8.1 Nash Equilibria in Open-Loop LQ Differential Games 
Finite-Horizon 


Consider a linear-quadratic differential game with finite horizon T. The calculation of open- 
loop Nash equilibria is based on the solution of coupled matrix Riccati differential equations 
(RDEs), which can be derived from Pontryagin’s minimum principle. Therefore, applying 
Theorem 3.1 to LQ differential games leads to the following result. 


Theorem 3.4 (Sufficient Conditions for OLNE solutions in Finite-Horizon LQ 
Differential Games) 


Consider an N-player LQ differential game as in Definition 3.11 with the additional con- 
straints Q;,Q;7 = 0,i € P. Let there exist a set of matrix-valued functions P;, i € P, 
which satisfy the Riccati differential equations (RDEs) 


N 
Pitt) = —P,(t)A— A" P;(t) + » P;(t)B;R;,B}Pj(t)-O;, ieP, (3.60) 
j=l 


with the transversality conditions 
P(T)=Q,7, IEP. (3.61) 


Then, the LQ differential game has a unique OLNE for every initial state xo. Moreover, the 
resulting N-tuple of equilibrium controls u* is defined by the controls 


u;(t) = y;(x0,t) = -R;; B; P;(t)®(t, 0)xo, IEP. (3.62) 


Here, ®(t, 0) satisfies the differential equation 


N 
&(t,0) =|A- > so) (1,0), (t,t) =I, (3.63) 
j=l 
where 
S;=BjR;;Bj, jeP. (3.64) 
Proof: 


See Section B.1 ofthe Appendix. o 
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Theorem 3.4 gives an approach for calculating Nash equilibria by solving the RDEs (3.60) 
with the conditions (3.61). Nevertheless, cases exist where these do not have a solution, but 
the LQ differential game still has a solution [BO99, p. 314]. 


In case the system is not affected by any disturbance during the complete game duration, the 
controls can be formulated in the form of an optimal feedback law 


yi(x, t) = —Rj/ BY P,(t)x(t), ie P. (3.65) 


Infinite-Horizon 


In an infinite-horizon case, i.e. T — ov, the matrices P; are constant (P; = 0), resulting in 
coupled algebraic Riccati equations (ARE) and leading to the following result. 


Theorem 3.5 (Sufficient Conditions for OLNE solutions in Infinite-Horizon LQ 
Differential Games) 

Consider an N-player LQ differential game as in Definition 3.11 with T — œ and with the 
additional constraints Q; > 0 and Q; r = 0,i € P. Then, the LQ differential game has 
an OLNE for every initial state xo if a set of matrices P;, i € P, exists which satisfies the 
algebraic Riccati equations (AREs) 


N 
0=-PiA-A'Pı+ ) P;B)R;;B)P;-Q; ‚ieP (3.66) 
j=l 


and additionally leads to a stable closed-loop system”? 
N 
FEA SP, (3.67) 
ja 


i.e. the eigenvalues of F have a negative real part. The resulting N-tuple of Nash equilibrium 
controls u* is defined by (3.62), where P;(t) = Pi, i € P. 


Proof: 
See the proof of [BO99, Theorem 6.22]. o 


According to [BO99, p. 336], the existence of OLNEs in an infinite-horizon LQ differential 
game does not imply the existence of an OLNE in the finite-horizon version of the game. 
Moreover, a unique solution of the RDEs in a finite-horizon differential game may converge 


20 Note that the stabilizability of (A, [Bı, ..., Bn]) is necessary, a property which follows from the stabilizable 
pairs (A, B;), i € P, according to Definition 3.11. 
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for T — œ to a solution of the coupled AREs, but these are not necessarily stabilizing solu- 
tions and therefore would not constitute an OLNE of the infinite-horizon differential game. 


3.8.2 Nash Equilibrium in Feedback LQ Differential Games 
Finite Horizon 


Consider a LQ differential game with finite horizon T. Similar to the open-loop case, the 
calculation of feedback Nash equilibria is based on the solution of coupled RDEs, which can 
be derived from Theorem 3.2. We shall now restrict our attention to linear feedback strategies 
belonging to the set 

T = {y, | p(x.) = -Ki(e)x()}.- (3.68) 


This allows the formulation of the following theorem. 


Theorem 3.6 (Necessary and Sufficient Conditions for FNE solutions in 
Finite-Horizon LQ Differential Games) 


Consider an N-player LQ differential game as in Definition 3.11. The LQ differential game 
has a linear FNE for every initial state xo if and only if a set of symmetric matrix-valued 
functions P;,i € P, exists which satisfy the Riccati differential equations (RDEs) 


N 
P;()=-0;-Pit)A- A" P(t) + > P;(t)S;P;(t) +... 


ja 
N N (3.69) 
eer > P,(t)S;P;(t) - » P,(t)S;;P;(t), 
j=l j=l 
jei j+i 
where 
Sj = B;R;}B}, jEP, en 
3.70 
Sij = BR RyRy By.  ijEeP,i+j, 
and the transversality conditions 
P(T)=Q,7, ieP. (3.71) 


The resulting N-tuple of linear Nash equilibrium strategies y* is unique and defined by 


y;(x, t) = —R;;'B} P;(t)x(t) =: -Ki(t)x(t), i€. (3.72) 
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Proof: 
See the proof of [Eng05, Theorem 8.3]. o 


Generally speaking, the FNE arising from the solution of the coupled RDEs is not necessarily 
the only one. Basar reported in [Bas74] the existence of equilibrium strategies which are 
nonlinear functions of the state in discrete-time linear-quadratic dynamic games. Similarly, 
in [TM90] the authors present a specific LQ differential game example for which a nonlinear 
FNE exists. Therefore, Theorem 3.6 may not apply if the strategy space is enlarged as to 
include nonlinear strategies [Eng05, p. 365]. 


Infinite Horizon 


As in the finite-horizon case, we restrict our attention to linear feedback strategies. Never- 
theless, for infinite-horizon games, these are constant over time, i.e. they are defined by the 
set 


rI = {y; | y,(x, t) = -Kix(t)}. (3.73) 


Furthermore, these strategies (or alternatively, control laws) K = (K4, ..., Ky) are assumed 
to belong to the set 
F= {(Kı, Ky) |F is stable}, (3.74) 


which can be interpreted as a strive of the players for jointly stabilizing the system.?! A 
necessary and sufficient condition for the non-emptiness of 7 is the stabilizability of the 
matrix pair (A, [Bı - B n) [EBS00]. With these conditions in mind, the following result 
is stated. 


Theorem 3.7 (Necessary and Sufficient Conditions for FNE solutions in Infinite- 
Horizon LQ Differential Games) 


Consider an N-player LQ differential game as in Definition 3.11 with T — œ. Let the 
matrices P;, i € P, be symmetric solutions to the ARE 


N N N 

0= -Q,; - P;A- A'P; + X P:S;P; + PSP: = X P;SyP;, (3.75) 
j=l j=l j=l 
J#i jti 


21 According to [Eng05, p. 372], this corresponds to the supposition that both players have a first priority in stabi- 


lizing the system. Furthermore, for most games the equilibria without this stabilization constraint coincide with 
the ones corresponding to a game for which this constraint is included. Therefore, the stabilization constraint 
will not be active in most cases. 
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and additionally lead to a stable closed-loop system 


F=A- > S;P;, 
j=l 
where 
a ee (3.76) 
Si; = BiR} RRB},  LjEP,i+j. 


Then, there exists a linear FNE and the corresponding feedback strategies are defined by 
u;(t) = y}(x, t) = -R;; B] P;x(t) = —K;x(t). (3.77) 


Conversely, if a linear FNE exists and is defined by (3.77), then there exists a set of stabilizing 
matrices P;,i € P, which solve the AREs (3.75). 


Proof: 

In light of (A, [Bı --+ By]|) being stabilizable from the fact that the single pairs (A, B;), 
i € P, are stabilizable according to Definition 3.11, the rest of the proof is stated in [Eng05, 
Theorem 8.5]. Oo 


Theorem 3.7 was formulated with some freedom, as the results of the infinite-horizon case are 
established with the definition of a feedback Nash equilibrium specific for infinite-horizon 
LQ games which are based on the constant linear feedback strategies (3.73). Further details 
are given in Chapter 5, where the AREs are exploited to develop a method for inverse LQ 
dynamic games. In addition, it is worth noting that the solutions of the AREs (3.75) and 
therefore the FNE are generally not unique [Eng05, p. 381]. 


3.9 Summary 


This chapter presented fundamentals of dynamic game theory needed for the understand- 
ing of the inverse dynamic game methods introduced in this thesis. The following chapters 
are all based on games with the basic properties presented in Definition 3.3 and with mainly 
the Nash equilibrium as a solution concept—Nevertheless, a possible application to dynamic 
games with Pareto efficient solutions shall additionally be mentioned. Inverse dynamic game 
problems depend on further characteristics of the game, e.g. the information structure and 
strategy types as well as the assumed class of dynamic systems and cost function structure. 
The following three chapters introduce different kinds of inverse dynamic games and corre- 
sponding methods for their solution. 


4 Inverse Non-Cooperative Differential 
Games 


This chapter presents results on the solution of inverse differential games.” As described 
in Chapter 2, the aim of an inverse differential game is to calculate the cost functions play- 
ers minimized which gave rise to observed state and control trajectories. In the following, 
this problem is first formulated formally. Afterwards, the main contributions presented in 
this chapter are the proposal of an efficient method for solving inverse open-loop differen- 
tial games and the formulation of sufficient conditions for the uniqueness of the solution. 
Furthermore, the applicability of the method for inverse differential games with feedback 
strategies is demonstrated.” 


4.1 Problem Formulation 


The theoretical framework of non-cooperative differential games describes N agents treated 
as entities controlling the system based on the minimization of their individual cost functions, 
as introduced in Chapter 3. The non-cooperative nature of the game means that no contracts 
or agreements between players are in place while attempting to minimize their individual 
costs. Within the inverse problem of differential games, the result of the interaction between 
all players, i.e. the state and control trajectories, are assumed as given. A further important 
characteristic of the inverse differential game is that the interaction led to a Nash equilib- 
rium. Some work exists which investigates conditions under which Nash equilibria exist (see 
e.g. the results in [Luk71, Var70] and the discussions and references given in [BO99, Eng05]). 
However, these conditions are not general and not simple to formulate in terms of the sys- 
tem dynamics or the cost functions. Addressing the existence of Nash equilibria in general 
dynamic games is beyond the scope of this thesis and therefore, the following assumption 
will be made. 


22 In the remainder of this thesis, the term inverse differential game describes an inverse dynamic game in contin- 


uous time (cf. last paragraph of Section 3.1). 
The results of this chapter are based on the conference paper [RIK* 17] and the author’s contribution to the 
journal paper [MIF* 20]. 
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Assumption 4.1 (Nash Character of the Observed Trajectories) 


The observed state trajectories x(t) and control trajectories (uı(t),...,un(t)) of all players 
are Nash equilibrium trajectories x*(t) andu;(t) generated by a non-cooperative differential 
game defined by a set of non-trivial cost functions J* = {J;,..., Jų } and a dynamic system 
according to Definition 3.2. 


With this assumption, the inverse differential game problem is defined as follows. 


Definition 4.1 (Inverse Differential Game Problem) 


Let Assumption 4.1 hold such that state trajectories x*(t) and control trajectories u;(t), 
Vi € P, which correspond to a Nash equilibrium, are given. Find at least one set J such 


that Ji, Vie P, fulfill 


u;(t) = argmin J; (x*(t),u;(t), už ;(t)) 
u;(t) 


w.r.t. (4.1) 
x(t) = f (x(t), u(t), see un(t), t) 
x(0) = Xo. 


The formulation of the inverse differential game problem implies determining the cost func- 
tions J;, i € P, such that u} (t) solves the optimal control problems (4.1) which follow from 
Definition 3.7. Definition 4.1 allows for several types of Nash equilibria which arise depend- 
ing on the information structure of the game and the resulting strategy types. In particular, 
in this thesis open-loop and feedback Nash equilibria are considered. In addition, Defini- 
tion 4.1 establishes the search of "at least one set" of cost functions in consequence of the 
ill-posedness nature of inverse problems in optimal control and dynamic games. This means 
that several sets of cost functions exist which are equivalent in the sense that all of them are 
able to explain the same state and control trajectories. The concept of equivalence of cost 
functions is formalized in Section B.2 of the Appendix. 


The inverse differential game of Definition 4.1 is very general and represents a considerably 
complex task since there is an infinite number of possible cost functions varying in structure 
and parametrization which may potentially solve the inverse differential game problem. This 
issue is not unique to inverse dynamic or differential games as it also arises in the inverse 
problem of optimal control (single-player case). Therefore, parameters need to be introduced 
first. Two lines of research have been developed to achieve this. 


e Approximation of non-linear cost function structures by means of Gaussian processes 
[LPK11, LHF14] or alternatively, artificial neural networks [WOP16]. 
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« Setting the cost function structure as a linear combination of basis functions [MTL10, 
PJJB12, JAB13, AB14, MTFP16, PRBF18, JKL* 19]. 


The first approach utilizes parameterized kernel functions which determine the structure of 
the Gaussian process. In this way, non-linear rewards can be learned by maximizing the like- 
lihood function of the Gaussian process regression output and the kernel parameters under 
known observations of the state and control values. Nevertheless, finding these parameters 
is a computationally complex task which has only been solved succesfully in discretized state 
and control spaces (e.g. a grid world). On the other hand, the use of artificial neural networks 
usually demands large data sets and computation times. 


Therefore, the second approach is followed and presented in the following subsection. 
4.2 Basis Functions Approach 


In this approach, the cost functions are given a structure specified with basis functions which 
are defined as follows. 


Definition 4.2 (Basis Functions Vector) 

The vector @; € R™: contains the non-trivial functions pio (x(t), u(t), ..., UN (4), t), j € 
{1,....M;} which are called basis functions. Furthermore, the functions $;,(;) : R” x R™ x 
... X R”N x [0,T] > R are continuously differentiable in x and uı,...,un for all j € 
{1,..., Mi} 


The notation a;,(;) is used here and in the remainder of this thesis to represent the j-th entry 
of any vector a which corresponds to player i € P. 


Based on Definition 4.2, cost functions which consist of a linear combination of the basis 
functions are introduced, i.e. 


T 
Ji(®,,0:) = J 9; dx), u(t), ..., u(t), t) dt, (4.2) 
0 


where 0; € ©; C R”' are time-invariant parameters. The introduction of basis functions 


may appear stringent, yet it allows a wide variety of possible cost function structures.” 


24 Although the considered cost functions (4.2) have a so-called Lagrangian structure, i.e. cost functions with only 


integral costs, the methods and results of this chapter are also applicable to games with player cost functions 
with a Bolza structure, i.e. of the form (3.3). To do so, the terminal costs h;(x(T), T) must be written as a linear 
combination of basis functions as well. Afterwards, the Bolza cost function can be transformed into a Lagrange 
cost function by means of the fundamental theorem of calculus (see e.g. [Nai03, Section 2.7.1]). 
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In order to define a well-posed inverse differential problem with the newly introduced basis 
functions, the dynamics f and basis functions @; should be specified such that the observed 
states x(t) and controls (uı(t),...,un(f)) constitute a Nash equilibrium solution to the dy- 
namic game for some (possibly non-unique) cost-functional parameters 0; € ®;. Addressing 
the selection of suitable dynamics and basis functions is beyond the scope of this thesis. 
Therefore, the following assumption is introduced: 


Assumption 4.2 (Nash Character of the Trajectories w.r.t. a Differential Game 
with Basis Functions) 


The observed states x(t) and controls (u,(t), ...,un(t)) constitute a Nash equilibrium solu- 
tion to the differential game with system dynamics according to Definition 3.2 which are 
additionally continuously differentiable in x and uy,...,un, and cost functions of the form 
(4.2) consisting of basis functions p; according to Definition 4.2 and the unknown cost func- 
tion parameters 0; = 0; € O; fori E€ P. 


Assumption 4.2 specifies Assumption 4.1 for the introduced cost function structure estab- 
lished with the basis functions of Definition 4.2. The assumption of continuous differentia- 
bility of the system dynamics f is standard and permits the consideration of Theorem 3.1 
which shall be leveraged in the course of this chapter. With this introduced assumption, the 
inverse differential game problem regarded in this chapter is defined as follows. 


Definition 4.3 (Inverse Differential Game with Basis Functions) 


Let Assumption 4.2 be fulfilled such that state trajectories x*(t) and control trajectories 
u;(t), i € P, which correspond to a Nash equilibrium, are given. Determine at least one 
tuple of parameters 0 := (01, ..., On), with 0; € ©;, i € P, such that 


u;(t) = arg min J; ($;(x*(t), ui(t), u2,(t), t) , 01) 
u;(t) 


w.r.t. (4.3) 
x(t) = f (x(t), u(t), ... un(Z), t) 
x(0) = Xo 


for all players i € P. 


A consequence of the introduction of basis functions is the reduction of the general inverse 
differential game problem to a parameter identification problem. Despite this simplification, 
under Assumption 4.2, the inverse differential game problem will still have multiple solutions 
in general. One of the reasons is the following: if the trajectories x*(t) and (uj(t),..., u, (t)) 
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solve the dynamic optimization problems of Definition 4.3 with 0; = 0; € ©;, then the trajec- 
tories will also solve the dynamic optimization problems with 0; = c;0} for all scaling factors 
ci > 0. Furthermore, the zero vectors 0; = 0 are trivial solutions to the inverse differential 
game problem. Therefore, without loss of generality, trivial solutions and ambiguous scaling 
shall be excluded by considering parameter sets of the form ©; = {0; € R™ | 0; = 1} 
where 6; a) denotes the first element of 0;. The choice of the fixed-element constraint 0; (1) = 
1 is arbitrary and results analogous to those of this chapter will also hold with normalization 


constraints such as ||@;|| = 1.7 


4.3 Inverse Open-Loop Differential Games 


The inverse differential games of Definitions 4.1 and 4.3 imply finding cost functions such the 
solution of the N optimal control problems correspond to the given controls (uj (t), ..., wiy(t)). 
Since for a particular optimal control problem of player i, the other players’ controls u* ‚(t) 
are available, we can proceed to analyze these individual optimal control problems. For the 
forward problem of finding open-loop Nash equilibrium trajectories, the tools of optimal 
control theory, in particular the minimum principle of Pontryagin, are leveraged to obtain 
necessary conditions for open-loop Nash equilibria (cf. Section 3.6). Similarly, in this section, 
these conditions shall be exploited to find parameters 0; which solve the inverse differential 
game problem of Definition 4.3 in case of open-loop strategies. 


4.3.1 Residual-Based Approach 


The main idea consists of exploiting the fact that the observed trajectories correspond by 
assumption to a Nash equilibrium, i.e. x(t) = x*(t) and u;(t) = uf(t). These must fulfill 
the equations of Theorem 3.1 as these represent necessary conditions for Nash equilibria. 
Consider any player i € P. Besides the system dynamics equation, the costate equation 


p(t) = -Vx Hi (pit), x* (6), wilt), u(t), t) (4.4a) 
p(T) = 0, (4.4b) 


where (4.4b) follows from h;(x(T), T) = 0 due to the Lagrangian structure of the cost function 
(4.2), and the control equation 


u;(t) = arg min H; (w,(t),x*(t), ui(t), už (2), t) (4.5) 


u;(t) 


25 Both fixed-element (e.g. [MTFP16]) and normalization-constraint parameter sets (e.g. [ARARU* 11]) are popular 
in the related literature of inverse optimal control. 
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must be fulfilled. Since we consider no constraints on the control variables u;(t), the control 
equation (4.5) results in the Hamiltonian gradient condition 


0 = Vu, Hi (w;(t), x* (t), u(t), u(t), t) . (4.6) 
With the Hamiltonian function of player i being given by 
H; = 0; 9, (x(t), ui(t), u(t), t) + pi (©) f (t), uit), u~(t), t) (4.7) 


as a result of the cost function structure (4.2), the following definition is introduced. 


Definition 4.4 (Residuals) 
The functions 
2 
rc(0:,9,,1) = ||Vu, Hi (p 0t) luwe (4.8) 
x(t)=x*(t) 
and 
A 2 
r(;,;.t) = |p: + VrH: (pi 0t) [lawuo 3 (4.9) 
x(t)=x*(t) 

where || - || denotes the Euclidean norm, are called residuals of the control equation and 
the costate equation, respectively. 


The residuals of Definition 4.4 result from the insertion of the Hamiltonian (4.7) in (4.4a) and 
(4.6) and the subsequent insertion of the known optimal trajectories x*(t) and u} (t), which 
result in a dependence on the costate functions p; and the parameters 0; only. Note that 
rc(0;, p,) and r(0;, y;) are both equal to zero for 0; = 0; and w,(t) = w(t). Therefore, in 
light of this formulation, the idea of the proposed residual-based method consists of the 
computation of 6; € ©; and costate functions Q; : [0, T] > R” for each player i € P which 
solve the optimization problem 


T 
min r rc(09,,9,) + p r1(0i, pi) dt 
Yi bi 0 


s.t. 0; € Oj, 


(4.10) 


where p > 0 is a specifiable weighting factor. The intuition behind (4.10) is the following: 
y(t) and parameters 0; are sought such that the costate condition (4.4a) and Hamiltonian 
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gradient condition (4.6) hold for all t € [0,T].2° Under Assumption 4.2, 6; = 0; will be a 
(possibly non-unique) solution to (4.10). 


The solution of (4.10) is based on its reformulation as a quadratic program. For that purpose, 
it shall first rewritten as a LQ dynamic optimization problem. Let us define the matrix 


N Im, Om,xn| € RMix(Mi+n) (4.11) 


where Iy, denotes a square identity matrix with dimensions M;xM;. Similarly, 0 m,xn denotes 

a zero matrix with dimensions M; X n. Furthermore, we define the matrices R = I,, B := 
T ; ; ; 

[0,.x m I | and the time-variant matrices 


Ni) = [Vrp PFO] (4.12) 


VPVx9;(t) a) eee? VPY f(t) 


Q;(t) er Vu Gilt) Vu f(t) Vu Pit) Vu f(t) 


(4.13) 
where we use the shorthand Vy f(t) € R”*” and Vu, f(t) € R™”*" to denote the matrices 
of partial derivatives of f with respect to x(t) and u;(t), respectively?’, and evaluated with 
6;, x(t), and u;(t) for i € P. Similarly, we use Vx@,(t) € R™™”, and Vu,@,(t) € R™™™" to 
denote the matrices of partial derivatives of p; evaluated with 0;, x(t), and u;(t) for i € P. 
The following lemma rewrites the problem (4.10) as a LQ dynamic optimization problem. 


Lemma 4.1 


Consider any player i € P. The optimization problem (4.10) over the costates p; and 
parameters 0; is equivalent to the LQ dynamic optimization problem 


T 
min f z; (t)O,(t)zi(t) + v; (t)pRv;(t) + 2z; (t)N ;(t)v;(t) dt 
Zi Vi Jo 


s.t. (4.14) 
z;(t) = Bv;(t), te [0,T] 
Lz;(t) €0;,, te [0, T] 


over the functions z; : [0, T] > R*+” andv; : [0, T] > R” with the variable substitutions 


ait) = | and vilt) = #,(t). (4.15) 


26 TEN = 1and p = 1, this method recudes to the single-player approach presented in [JAB13]. 
27 The partial derivatives V,f(t) are defined here as the transposed Jacobian of f, ie. Vxf(t) = 
af ao 


Ox, Oxn 
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Proof: 
We note that the integrand of the objective functional of (4.10) may be rewritten as 


|V H: (tp: ©, 0l + p || s(t) + Vai au 


Dee i (t, 9,0), 0; | 
Vu, Hi (1.940, 0i) 


: [Rae + VPVx;(0); + ee) l 
Vu:Q;(t)0; + B 


VPVe Gilt) VPY f) NNG In| wel) 
Vap  Vu,f(t) oe i 


= z] (t)O,(t)zi(t) + v; (t)pRv;(t) + 2z] (t)N ;(t)v;(t) 


Om;xn 


where the second equality holds by recalling the definition of the player Hamiltonian (4.7), 
and the third and fourth equalities are obtained via matrix algebra by recalling the definitions 
of Q,(t), R, and N;(t) together with the variable substitutions (4.15). We also note that the 
constraint 0; € ©; may be equivalently written as 


Lz;(t) =0;€ Qi 


and the (implicit) constraint in (4.10) that 0; is time-invariant is equivalent to the constraint 


Omixn T . 
u F A Bl main 


Minimization of the functional 
ay 
J FOROA + TOPR) + 227 ONH) dt 
0 
over z; : [0,T] > R*+” and v; : [0,T] > R” subject to the constraints 2;(t) = Bv;(t) 


and Lz;(t) € ©; for all t € [0, T] is therefore equivalent to the minimization of the objective 
functional of (4.10) over w,(t) and 0; subject to 0; € ©; with the substitutions 


zi(t) = | j: N and v;(t) = #;(t). 


The lemma result follows and the proof is complete. o 


Lemma 4.1 establishes that (4.10) at the core of the proposed method can be rewritten as (4.14) 
with linear dynamic constraints, quadratic objective functional, and (partial) constraints on 
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the function Lz;(t). The following lemma shows that (4.14) can be solved as a LQ optimal 
control problem with an unknown initial state z;(0) resulting in a quadratic program. 


Lemma 4.2 (Quadratic Program Formulation) 

Consider any player i € P and suppose that p > 0 is selected such that the matrix 
Q,(t) - Nilt)p'RT'N/ (t) is positive semidefinite for allt € [0,T]. A pair of functions 
2; : [0, T] > R*+” and ò; : [0, T] > R” solves the dynamic optimization problem (4.14) 
if and only if the initial value of z;(0) = @; € R“‘*” solves the quadratic program 


min a; P;(0)a; 
Œi 


(4.16) 
s.t. La; € Oi 
and the pair of functions satisfy the differential equation 
Zi(t) = Bò; (t) = BKi(t)2i(t) (4.17) 


for allt € [0, T] where K;(t) := —p~' [B'P;(t) + N; (t)] andP; : [0, T] > RMP) 
is the unique symmetric positive semidefinite solution to the Riccati differential equation 


0 = P;(t) - p-'(Pi(t)B + N:())(B" P] (t) + Nj (t)) + O,(t) (4.18) 


fort € [0,T] with terminal boundary condition P;(T) = 0. 


Proof: 

Consider any player i € P. We first note that given a function v; : [0, T] > R” together with 
an initial value z;(0) = œ; € RM*" with Læ; € ©;, we may solve the differential equation 
2;(t) = Bv;(t) for the unique function z; : [0, T] > R*+”, The constraints in the dynamic 
optimization problem (4.14) from Lemma 4.1 therefore imply that the optimization in (4.14) 
may be rewritten as only over z;(0) and v;. Namely, (4.14) is equivalent to the unknown 
initial state optimal control problem 


T 
minmin | z; (t)O,(t)z;(t) + v; (t)pRv;(t) + 2z] (t)N ;(t)v;(t) dt 
i Jo 


ai v 
s.t. 
z;(t) = Bv;(t), t € [0,T] 
zi(0) = @i 
La; € Oj. 


(4.19) 


For any œ; € R™*”, the inner optimization problem over the function v; in (4.19) is a stan- 
dard LQ optimal control problem with cross-product terms. 
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Under the positive definiteness of R = I, as well as p > 0 and the positive semidefiniteness 
of the expression O,(t) — N;(t)p RIN] (t), Section 3.4 of [AM89] gives that for any z;(0) = 
a; € RMi*n, the unique function solving the inner optimization problem over v; in (4.19) 
is 


d;(t) = K;(t)zi(t) (4.20) 


for all t € [0,T] where K;(t) = —p7! [B"P;(t) + N ©] and P; : [0, T] WH RMitmx(Mi+n) is 
the unique symmetric positive semidefinite solution to the Riccati differential equation (4.18) 
with P;(T) = 0 (see also [Kuč73, Kal64]). Section 3.4 of [AM89] also gives that the minimum 
value of the inner optimization problem over v; in (4.19) is 


a; P;(0)@; (4.21) 


for any initial state z;(0) = a@;. The function 2; solving the inner optimization of (4.19) 
satisfies 2;(t) = BK;(t)Z;(t) for any initial state &;. Consequently, the unknown initial state 
optimal control problem (4.19) simplifies to the quadratic program (4.16). It follows that the 
pair of functions (Z;, 0;) solves (4.14) if the functions satisfy the differential equation (4.17) 
and z;(0) = @; solves (4.16). 


In the following, the “only if” part of the lemma assertion is proved — i.e., that if the pair of 
functions (Z;, 0;) solves (4.14), then they satisfy the differential equation (4.17) and Z;(0) = @; 
solves the quadratic program (4.16). We first note that the function ò; solving the inner op- 
timization problem over v; in (4.19) is unique and given by (4.20) for any given œ; € RM. 
Thus, if the pair of functions (Z;, 0;) solves (4.14), then it must satisfy the differential equa- 
tion (4.17). Since the unique form of ò; implies that (4.19) reduces to the quadratic program 
(4.16), then z;(0) = &; if the pair of functions (Z;, 0;) solves (4.14). The lemma result follows 
and the proof is complete. Oo 


Lemma 4.2 allows us to solve the quadratic program (4.16) for the initial values Z;(0) = &; 
instead of solving (4.14) for the functions zZ; over the entire interval t € [0,T]. Recalling 
Lemma 4.1 and the definition of the vectors z;(0), we note that the initial values z;(0) = @; 
correspond to the vector 


alo oo] (4.22) 


where 0; and , are solutions to the residual-based method (4.10). Together, Lemmas 4.1 and 
4.2 therefore allow us to sidestep the difficult problem of directly solving and analyzing the 
original optimization problem (4.10) and instead solve the quadratic program (4.16) for the 
parameters 6; = Lgi. 
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Remark 4.1: 
The choice of p = 1 is always sufficient to ensure that the expression Q;(t)— N;(t)p RN] (t) 
is positive semidefinite for allt € [0,T] since 
O;(t) - N(R N] (t) 
= Q;(t) - Ni(t)N; (t) 
[Mud OVubilt) Vad) Ou fO) 
Vaf Vu Git) Vf OV fO 
= [WG Vu fO] [Wud Vu fO] 


with the first equality holding due to the definition of R, and the second and third equalities 
following by substituting the definitions Q,(t) and N;(t). Other values of p may result in 
Qi) - Ni(Ðp RN] (t) not being positive semidefinite and thus leading to multiple solu- 
tions of (4.18). 

In the following, the results of Lemmas 4.1 and 4.2 shall be used to establish novel explicit 
expressions for the parameters Ô; that solve the inverse differential game problem. Further- 


more, sufficient conditions shall be presented under which the parameters 6; are guaranteed 
to be unique and identical to the original parameters 6; up to a multiplying positive factor. 


4.3.2 Sufficient Conditions for the Uniqueness of the Solution 


To present the main result on the solution of the residual-based method (4.10), consider the 
matrix P;(0) of the optimization problem (4.16) and define 


Pia)  --- Pia.m,+n)(0) 
Pie, (0) =... Pia, M;+n)(0) 
is (4.23) 
Pi (Mmi+n,2 (0) --- Pi (M;+n,M;+n)(0) 


as the principal submatrix of P;(0) formed by deleting the first row and column of P;(0), 
and 


: T 
Pi := [Pian Pian --- Picaj+n(0] (4.24) 
which denotes the first column of P;(0) with deleted first element. Furthermore, let 


P; =U,;x?uU} 
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be the singular value decomposition (SVD) of P; where £? e RMi+n-)x(Mi+n-1) is a diagonal 
matrix, and 


U: U” 
U: = i i = R(Mitn-D)x(Mi+tn-1) 4.25 
i los u? ( ) 


is a block matrix with submatrices u! € RMi-1xr? US € RMi-Dx(Mi+n-1<r?), u ERD? 
and U? € Re Mitn Ie), Finally, P; and r?” represent the pseudoinverse and rank of the 
submatrix P;, respectively. To present the main result, we recall the introduced parameter 


set 
9; = {0; E R™ |0, = 1} (4.26) 


so as to exclude the trivial solution 6; = 0 and to exclude non-uniqueness due to scaling. 
As discussed in Section 4.2, there is no loss of generality with this parameter set since the 
ordering and scaling of the basis functions and cost function parameters is arbitrary. 


Theorem 4.1 (General Solution of the Residual-Based Method) 

Consider any player i € P, and let ©; = {0; € R™ | Oia) = 1}. All of the parameter 
vectors 0; € ; corresponding to all solutions (p;, d,) ofthe proposed method (4.10) are of 
the form 


6; = Lå; (4.27) 
where &; = [1 a7] e RMi+n are (potentially non-unique) solutions to the quadratic 
program (4.16) with @; € a 1 given by 

A +- 0; 
a2 (4.28) 


where 0, € RT and for any b € RMi+n-1-r? Furthermore, if either U!” = 0 or P; has 


full rank, i.e. r? = M; +n- 1, then all solutions (w,, 6; i) to the proposed method (4.10) 
correspond to the single unique parameter vector 0; € O; given by 


6; = | a (4.29) 


Proof: 

Lemmas 4.1 and 4.2 together imply that all solutions to the original optimization problem 
(4.10) of the proposed residual-based method have parameter vectors given by 6; = La; 
where @; is a solution to the quadratic program (4.16). We thus proceed by analyzing (4.16). 
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For any œ; € Rt” with La; € ©; where ©; = {0; € R™ | Oia) = 1}, we have that 


a; = [1 al 


M;+n-1 
i R™ 


T = 
| where @; € and so 


al P0)o,=[1 a] Pi(0) [i | 


= Pi (1,1)(0) + @ i Pia; + 2a) i Pi 


where P; (1,1)(0) is the first element of P;(0). All solutions &; of the constrained quadratic 


program (4.16) with ©; = {0; € RM: | Oia) = 1} are therefore of the form @; = [1 PA 


i 
RMitn-1 


where &; € are solutions to the unconstrained quadratic program 


We note that P;(0) is symmetric positive semidefinite which guarantees the existence of a 
solution of (4.16). Furthermore, this leads to P; also being symmetric positive semidefinite. 
With these conditions fulfilled, [Gal11, Proposition 15.2] gives that the equivalent uncon- 
strained quadratic program is solved by any @; satisfying 


0, 
b 


for any be RMitn-1- ri . The first theorem assertion (4.27) follows. 


&i = -P$ p, + U; 


Now, to prove the second theorem assertion we note that if U? = 0, then 


i 

I 

> 
+ 


for anyb € RMi+n-1-r? where Ub € R”. Clearly, if rË = M; +n — 1, then we also have 
that 


Thus, if either U}? = 0 orr?” = M;+n-1, then the first M; — 1 components of &; are invariant 


with respect to the free vector b € RMi+R-1- r? , and so all solutions &; = [1 a! sal of the 
constrained quadratic program (4.16) satisfy 
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due to the definition of L (cf. (4.11)). The second theorem assertion follows since 6; = La; 
which completes the proof. o 


Theorem 4.1 establishes that the conditions UP = 0 and rË = M; +n- 1 are both suff- 
cient for ensuring the uniqueness of the player cost-functional parameters 6; computed with 
the proposed method (4.10). These conditions will not hold when the inverse differential 
game problem is ill-posed — for example, on short time-horizons T, due to degenerate sys- 
tem dynamics, or when the trajectories are uninformative (e.g. when the trajectories x*(t) 
and (u}(t),...,u3,(t)) correspond to a dynamic equilibrium of the dynamics in the sense that 
X(t) = 0 for all t € [0,T]). The conditions Uj? = 0 and ae = M; +n- 1 may be interpreted 
as analogous conditions to the persistence of excitation conditions known from parameter 
estimation and adaptive control. 


The following corollary establishes that, under the assumption that the ill-posedness of the 
inverse differential game problem is only due to an unknown scaling factor, then Uj? = 0 
and rP = M;+n-1 become sufficient conditions for ensuring that the residual-based method 
(4.10) yields unique player cost-functional parameters that only differ from the true player 
cost-functional parameters 0; by an unknown scaling factor c; > 0 when Assumption 4.2 


holds. 


Corollary 4.1 (Uniqueness up to a Scaling Factor) 

Suppose that Assumption 4.2 holds. Consider any player i € P, and let ©; = {0; € R™ | 
Oia) = 1}. If either U? = 0 or rP = M; +n- 1, and if there exists ac; > 0 such that 
ci0} € Qj, then 


6;=L E = c;6% (4.30) 


i 


is the unique parameter vector corresponding to all optimal solutions (Ẹ;, 6;) of the residual- 
based method (4.10). 


Proof: 

The necessary conditions for open-loop Nash equilibria of Theorem 3.1, i.e. (4.4) and (4.6) im- 
ply that (w,,c;0%) (with y; solving (4.4) under w;(T) = 0 and 0; = c;6;) is always a solution 
to the proposed method (4.10) under Assumption 4.2. Since the conditions of the corollary 
give that c;0} is in @;, and since the second assertion of Theorem 4.1 implies the uniqueness 
of the parameter vector 6; € O; corresponding to all optimal solutions of the residual-based 
method (4.10) if either U” =0or r? = M; +n — 1, we must have that Ô; = c;0; when either 
U? = 0or rP = M; +n- 1 holds. The corollary assertion follows. o 
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In the following, the implications of each condition of Theorem 4.1 to the originally posed 
residual-based method (4.10) is analyzed. 


Full-Rank Condition 


In Corollary 4.1 and Theorem 4.1, if rË = M; +n- 1 holds then both the player cost- 
functional parameters Ô; and costate functions w, solving (4.10) are unique. To see that a 
unique pair (,, 6;) solves (4.10) when rË = M; +n — 1, we note that the first assertion of 
Theorem 4.1, specifically (4.28), implies that the vectors &; = [1 ar] are unique solu- 
tions to the quadratic program (4.16) if r? = M; +n — 1 because the free vector b will be 
zero-dimensional. Now, since Lemmas 4.1 and 4.2 imply that the vectors &; = 2;(0) corre- 


2 
; 3: (0)| , and since Lemma 4.2 implies a unique function , for each initial 


spond to la; 


condition w;(0), we have that the pair (Ẹ;, 6;) is indeed the unique solution to (4.10) when 
rP =M; +n-1. 


SVD Matrix Condition 


The condition UL = 0 can hold when rP <M,+n-1. If U? = 0 but r? < M; +n- 1, then 
the second assertion of Theorem 4.1 implies that all pairs (,, 6;) solving (4.10) will share 
the unique parameter vector 6; given by (4.29) but may not share a common costate function 
;(t). The condition U!” = 0 prohibits the elements of &; corresponding to 6; (but not w;(0)) 
from depending on the free vector b in (4.28). 


4.3.3 Algorithm and Example 
In light of Theorem 4.1 and the role of the conditions U?” = 0 and rP = Mi +n- 1, the 
residual-based method (4.10) can be implemented for each player i € P with the following 


algorithm: 


Algorithm 1 Residual-based method for player i in an inverse OL differential game. 


Input: State and control trajectories x*(t) and (u}(t),..., u(t), dynamics f, basis functions 
$; and parameter constraint set ©; = {0; € R™ | bia = 1}. 

Output: Computed Player i cost-functional parameters 0;. 

Compute Q,(t) and N;(t) from (4.12) and (4.13), t € [0, T]. 

Solve Riccati equation (4.18) with P;(T) = 0 for P;(0). 

Compute submatrix P; from (4.23) and vector p; from (4.24). 

Compute rank rP of P;. 


Gt corer OS Ne 


Compute pseudoinverse P; of Pj. 
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6: if r? = Mi +n- 1 then 

7: return Unique 0; = 6; given by (4.29). 

8: else 

9: | Compute U; and U?’ in (4.25) through SVD of P;. 
10: if Uj? = 0 then 


11: return Unique 0; = 6; given by (4.29). 

12: else : 
13: return Any 6; = 6; from (4.27) with any b € RMitn-1-r7 , 
14: endif 

15: end if 


Hence, the core of the proposed residual-based method with Algorithm 1 is the solution 
of a RDE and thus we avoid the need to solve nested differential game or optimal control 
problems. Furthermore, we are also free to compute the cost function parameters of each 
player separately (rather than as part of the same optimization). Finally, the presented method 
gives conditions under which the computed parameters are unique in the parameter set ®;. 
These conditions hold for N-player inverse differential games and therefore valid for the 
special case of (single-player) inverse optimal control as well. 


To conclude, an example illustrating the results of this section is presented. 


Example 4.1: 
Consider an optimal control problem, i.e. a one-player differential game, with system dy- 


namics 
x(t) = u(t) (4.31) 


where uı(t) € R and with an initial state value x9 = 1. Let the cost function be of the form 
(4.2) with T = 3 and the basis functions 


pi Ou, = [A x) Oa] (4.32) 
and cost function parameters 
6,=6;=[1 5 2J. (4.33) 


The optimal control problem is solved for the optimal state and control trajectories in Fig- 
ure 4.1 by applying the minimum principle and solving the coupled differential equations 
numerically. These trajectories are unique solutions to the problem since 0; satisfies the pos- 
itive definite and positive semidefinite conditions of [AM89, Section 3.4]. To solve the inverse 
optimal control problem, Algorithm 1 is applied. The Riccati equation leads to the submatrix 


0.4614 -0.6126 -0.6126 
Pı = |-0.6126 0.9951 0.9951 (4.34) 
—0.6126 0.9951 0.9951 
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which is rank deficient. Computing the SVD of P, yields 


—0.4113 —0.9115 0.0000 
U, = | 0.6445  -0.2909 —-0.7071 (4.35) 
0.6445 —0.2909 0.7071 


and therefore U? = [0 -0.7071] "#0 which implies that there are not unique parameters 
0, € ©, solving the inverse optimal control problem. Thus, the general solution is given 
by (4.28). By inspecting this solution, we observe that the first parameter of &;(0) which 
corresponds to 0, (2) can uniquely be recovered (cf. the first entry of U}?). Nevertheless, the 
free parameter b € R affects the parameter 0, (3), leading to the non-uniqueness. Using (4.28), 
the general solution of 0, can be formulated as 


1 0 
09,=| 5 |+ 0 b, DER. (4.36) 
4.467 —0.7071 


Indeed, by solving the optimal control problem again with (4.36) and any b € R, it is con- 
firmed that the optimal trajectories x*(t) and uï (t) are unaffected by the choice of b. 


— x*(t) — ul) 


0 
—1 
-2 
| | I | l 
0 0.5 1 1.5 2 2.5 3 
tins 


Figure 4.1: State and control trajectories solving the optimal control problem of Example 4.1 


4.4 Inverse Feedback Differential Games 


The inverse differential game problem assuming a feedback information structure consists in 
finding the cost function parameters of all players such that the observed trajectories corre- 
spond to a feedback Nash equilibrium. 


As already noted in Section 3.6.2, the Nash solution of one player depends on the Nash con- 
trols of all other players. More specifically, the differential equation of the costate variables 


64 4 Inverse Non-Cooperative Differential Games 


corresponding to player i depends on the other controls since, due to the closed-loop in- 
formation structure, these depend on the state variables. In other words, the control u;(t) 
is determined by a feedback strategy in the form of a control law u,(t) = y;(x, t). As dis- 
cussed in Section 3.6.2, the conditions presented in Theorem 3.1 now include the new costate 
equation 


p(t) = -V.H;@p,;(t), x*(t), u;(t), ya“, t), t) 


in order to account for the other players’ strategies dependency on the state variable. 


4.4.1 Residual-Based Approach 


In order to apply the residual-based method, the following assumption is introduced. 


Assumption 4.3 (Control Laws) 
The Nash equilibrium control laws u(t) = y; (x,t) are known for all players i € P. 


Under Assumption 4.3, instead of (3.1), we have 


x(t) = fi (x(t), yi(x, t),...,ui(t),... yy (x t) t), (0) = x0 (4.37) 


which represent system dynamics from the point of view of each player i € P. Furthermore, 
Assumption 4.3 leads to the basis functions 


pi (x(t), ui(t), yt), t), ieP. (4.38) 


Under Assumption 4.3 and the consequently introduced system dynamics (4.37) and basis 
functions (4.38), we obtain the Hamiltonian function of player i 


H; = 0:0, (x, ui, Yit) + y fi (x, ui, YŽ i t) 5 (4.39) 


where the implicit dependencies were omitted for brevity. 


Residuals can be introduced analogously to Section 4.3 according to Definition 4.4. Thus, the 
inverse differential game with feedback strategies can be solved by applying the residual- 
based method (4.10). Using (4.37) and (4.38), we redefine the matrices 


Ni(t) = [PYxp E) Pafi] (4.40) 


VPVx$;(t) eri abe’ VPVxfi(t) (4.41) 


Q(t) z VaQ:(t) Vu, fiH Vu 9; (4) Vu, fi) 
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and note that the differences with respect to the open-loop case arise from the influence of 
the new system dynamics f, and basis functions on the partial derivatives. With these def- 
initions, we can proceed analogously to the open-loop case, ultimately yielding Lemmas 4.1 
and 4.2. Consequently, analogous results to Theorem 4.1 and Corollary 4.1 can be formulated. 
The formal theorem statements and proofs are omitted here. 


4.4.2 Example 


The following example illustrates the application of the residual-based method for inverse 
feedback differential games. 


Example 4.2: 


Consider a two-player differential game with system dynamics 
x(t) = -x(t) + u(t) + u2(t) (4.42) 


where u;(t) € R, i € P, and with an initial state value xo = 5. Let the cost function be of the 
form (4.2) with T = 6 and the basis functions 


= 
pi (x(t), wi(t),£) = [O x2(t) | ‚ Lje{L2hi#j (4.43) 
and cost function parameters 


(4.44) 

5=%&=|1 2 1]. (4.45) 
These parameters are used to solve for the Nash equilibrium state and control trajectories 
depicted in Figure 4.2. Since a linear-quadratic differential game lies at hand, this was done 
by solving the coupled Riccati equations (3.69), which also confirms the Nash character of 
the trajectories according to Theorem 3.6. This inverse differential problem is illustrated by 


recovering the cost function parameters of player 1. The feedback strategies of each player 
have the form u} (t) = y;(x, t) = —k}(t)x(t), leading to the system dynamics 


X(t) = —x(t) + u(t) - k3 (t)x(t) (4.46) 
and the basis functions 


Qi (x(t), u(t), t) = pO X) dx)" (4.47) 
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according to (4.37) and (4.38). The Riccati equation of the residual-based method leads to the 


submatrix 
16.045 7.499 —7.470 


Pı = | 7.499 3.505 —3.491 (4.48) 
7.470 -3.492 3.642 


which has full rank equal to Mj + n — 1 = 3. Therefore, with the results of Theorem 4.1 and 
Corollary 4.1, we obtain the unique solution 


6, = [1.000 1.000 10.000] " = 0}. (4.49) 


= u(t) — ws) | 


Figure 4.2: State and control trajectories solving the differential game in Example 4.2 


The presented example illustrates Theorem 4.1 for inverse feedback differential games, allow- 
ing the identification of cost function parameters if the control laws of all players are known, 
according to Assumption 4.3. Interestingly, in Example 4.2, the cost function parameters of 
player 1 could be exactly recovered, even though the basis functions were partially redundant 
due to the fact that the control of player 2 depends on the state variable. However, since k(t) 
was exactly known for all t € [0, T], the proportion of its influence on the state variable could 
be distinguished by the method. 


4.5 Method Limitations 


Before concluding this chapter, possible limitations of the presented methods shall be dis- 
cussed. A first issue could emerge if only truncated trajectories are available, i.e. we only 
have access to the trajectories (and control laws, in the feedback case) for t € [0, T] with 
T < T. The method can still be applied, but the quality of identification depends on the 
extent up to which the available truncated trajectories represent the complete optimal tra- 
jectories. 
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A further issue arises if Assumption 4.2 does not hold. This assumption may be violated e.g. 
due to misspecified dynamics or basis functions, or imperfect trajectories.”* Additionally, the 
violation might be even more severe if the trajectories do not even represent a Nash equi- 
librium, regardless of the basis functions or the system dynamics. In either case, by solving 
(4.10), parameters 6; and functions w(t) result such that (4.4a) and (4.6) hold approximately 
with their priority assigned via choice of p. Due to the fact that the approach is based on con- 
ditions for Nash equilibria which are generally only necessary, it cannot be always guaranteed 
that the resulting parameters can be used for determining Nash equilibrium trajectories. 


Lastly, the exact knowledge of the feedback strategies as implied by Assumption 4.3 is a 
rather strict assumption. Nevertheless, given that the state x*(t) and control trajectories už (t), 
i € P, are available, it is possible to at least determine an approximation using parameter 
estimation techniques. This will be examined in the next chapter in the context of inverse 
linear-quadratic differential games. 


4.6 Conclusion 


In this chapter, an inverse differential game method based on necessary conditions for Nash 
equilibria was presented. The main idea consisted in the formulation of residuals which 
represent the violation of the open-loop Nash equilibrium conditions if the parameters (and 
costate functions) do not correspond to a Nash equilibrium under the observations of the state 
and control trajectories. The minimization of the residuals lead to a dynamic optimization 
problem for each player i, the minimizers of which are given by the sought cost function pa- 
rameters of that specific player. The method is substantially based on the solution of a Riccati 
differential equation and a static quadratic program, thus avoiding the expensive computation 
of Nash equilibrium trajectories in each iteration and allowing for the statement of sufficient 
conditions for the unique solution of the cost function parameters in an inverse open-loop 
differential game. 


Moreover, an approach to solve inverse differential games with feedback strategies was pre- 
sented. It was shown that it is possible to formulate a residual-based method for the feedback 
case by assuming the knowledge of the control laws. In this way, the sufficient conditions 
for the solution uniqueness are still valid. Nevertheless, in general, the control’s dependence 
on the states may lead to redundant basis functions which potentially make the exact esti- 
mation of the cost function parameters difficult due to the ambiguity of the solution of the 
residual-based method. 


This chapter presented results for finite-horizon inverse differential games. The following 
chapter deals with inverse problems for the class of infinite-horizon linear-quadratic differ- 


28 The latter two cases shall be examined in Chapter 7. 
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ential games and aims at gaining additional insight by exploiting the particular system and 
cost function structure. 


5 Inverse Non-Cooperative Linear-Quadratic 
Differential Games 


This chapter is devoted to the solution of inverse problems in non-cooperative linear-quadratic 
differential games. This particular class of inverse differential games arises if the dynamic 
system all players are controlling is linear and a quadratic structure of the player cost func- 
tions is given. Furthermore, the considered planning horizon is infinite, leading to constant 
linear feedback strategies of the players. Linear system dynamics and quadratic cost func- 
tions are ubiquitous in control theory and therefore, the properties of this kind of inverse 
differential games are thoroughly investigated. The techniques employed in this chapter are 
similar to the ones applied in Chapter 4 in the sense that control-theoretical conditions for 
Nash equilibria are leveraged, i.e. an inverse optimal control approach is applied. The main 
contribution presented in this chapter consists of the formulation of explicit solution sets 
describing all possible solutions of an inverse LQ differential game with an infinite horizon. 
The dimensions of this solution set depend on the characteristics of the differential game, e.g. 
number of states, controls and players. Furthermore, necessary and sufficient conditions are 
given for the uniqueness (up to a positive factor) of the inverse differential game solutions. 
Finally, on a more practical side, a quadratic program is formulated which allows the efficient 
computation of one solution (belonging to the whole solution set) and the corresponding al- 
gorithm for implementation is presented. The chapter ends with an illustrative example of 


the method and a conclusion.?” 


5.1 Problem Definition 


Consider a continuous-time N-player noncooperative differential game of linear-quadratic 
type according to Definition 3.11. Therefore, the continuous-time state process of the game 
is described by the initial value problem 


N 
x(t) = Ax(t) + > B,u;(t) (5.1a) 
i=1 


x(0) = xo (5.1b) 


29 The results of this chapter were partially previously published in the journal paper [IBM* 19]. 
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where it is further assumed that (A, [B 1c: B N) is stabilizable. Following the explana- 
tions in Section 3.8.2, the results of this chapter shall be restricted to the consideration of 
constant linear feedback strategies, i.e. strategies y; belonging to the set (3.73). Therefore, 
the control trajectories are given by 


uilt) = —K;x(t), Vie P, (5.2) 


with the control laws K = (Ki, ..., Kn) (cf. (3.77)). In particular, these lead to a stable closed- 
loop system (cf. (3.67)) 


N 
F=A-)°B,Kj, (5.3) 
j=l 


i.e. they belong to the set of stabilizing control law tuples defined in (3.74). 


In this chapter, a Lagrangian quadratic cost function 
1 f> > 
Ji(xo, K, Q;, Rij) = F x'Q;x + X uy Riu; dt, (5.4) 
0 = 
JA 


is considered for each player i € P, where the same matrix assumptions as in Definition 
3.11 are made, i.e. Q,, R;; are symmetric for all i,j € P and R; > Oforalli € p 30 By 
posing (5.4), a particular structure of the cost functions of all players is defined, similar to 
the basis function approach considered in Section 4.2. Indeed, a cost function of the form 
(5.4) can be equivalently represented as a cost function with basis functions as introduced in 
(4.2).?! The cost function J; in (5.4) is written as a function of the N-tuple of feedback laws 
K = (K,,..., Kn) and the initial state x, since together these generate the state and control 
trajectories x(t) and u;(t) via (5.1) and (5.2). Finite cost function values are guaranteed by the 
restriction to strategies or feedback laws belonging to F as defined in (3.74). 


In this chapter, feedback Nash equilibria are considered which are defined in the context of 
infinite-horizon LQ differential games as follows (cf. Definition 3.7). 


30 Note that no definiteness assumptions on Q;, i € P, are made since the control laws are restricted to the 


stabilizing set F (cf. [EBS00]). 
31 This follows directly from e.g. 3x'7Q,x = 0] p; with 0; = vec(Q,) and where ¢; has the elements Bd, = 
$x1Xp, YL, p € {1,..., n}. 
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Definition 5.1 (Feedback Nash Equilibrium ([EBS00])) 


An N-tuple K* = (K}, ..., Ky) € F is called a stationary linear feedback Nash equilibrium 
if 

Ji(Xo, K*, Q; Rij) < Ji(xo, K*;(ß), Q; Rij), (5.5) 
holds for alli € P, all xy € R”, and all B such that K*,(B) € F, where K*,(B) = 
(K$, ig RK, 


i+1? 


The FNE is generally not unique (cf. Section 3.8.2), i.e. various tuples K* corresponding to 
a particular infinite-horizon LQ differential game may exist. However, in the following, one 
specific FNE denoted by K* shall be considered. 


The following definition is introduced before formalizing the inverse LQ differential game 
problem. 


Definition 5.2 (Canonical Parameter Set) 


The canonical parameter set of the LQ differential game is the set © which contains all 
possible cost function parameters of (5.4), i.e. all possible matrices Q, and Rij, Vi,j € P, 
which lead to the Nash equilibrium given by K*, i.e. 


@ = {0;|ieP, K* = K(6},...,On) fulfills (5.5)}, (5.6) 


where 0; contains the elements of the matrices Q; and Rij, i,j € P. 


This definition follows directly from the ill-posedness characteristic of inverse differential 
games. It allows for describing a general set of solutions of inverse differential game which do 
not necessarily differ in a constant parameter solely. Furthermore, the following assumption 
is introduced. 


Assumption 5.1 


The Nash equilibrium feedback matrices K* € F are known. 


With this assumption, which is similar to Assumption 4.1 made in the last chapter, the in- 
verse infinite-horizon LQ differential game problem considered in this chapter is defined as 
follows.” 


32 In the remainder of this chapter, the considered inverse problem shall be referred to as inverse linear-quadratic 


differential game problem. The infinite-horizon property shall be omitted for the sake of brevity. 
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Definition 5.3 (Inverse Linear-Quadratic Differential Game Problem) 


Consider an infinite-horizon LQ differential game consisting of system dynamics (5.1), 
where A, Bi, Yi € P are given, and unknown cost functions (5.4). Furthermore, let Assump- 
tion 5.1 hold such that Nash equilibrium feedback matrices K* are available. Determine the 
canonical parameter set © described in Definition 5.2 


While this problem definition is related to the problem in Definition 4.3, it is different in the 
sense that not only one single tuple of parameter vectors 0 = (01, ..., On) is sought, but the 
complete set of (equivalent) possible tuples of parameter vectors which lead to a given Nash 
equilibrium. Furthermore, instead of a Nash equilibrium described by the trajectories x*(t) 
and u;(t), i € P, the availability of a Nash equilibrium described by a tuple of control laws 
K* is assumed. 


Remark 5.1: 


By solving the problem of Definition 5.3 we can also solve the related problem of finding ©, if 
instead of K*, trajectories x*(t) and u;(t), i € P, are given. This follows from the fact that K} 
can be estimated via (5.2). Indeed, such an estimation is commonly performed in single-player 
inverse LO optimal control, e.g. in [PCC* 15] and [FMM* 18], where the proposed methods also 
rely on the availability of a control law. Further details on the estimation of K* are given in 
Section 5.4.2. 


5.2 Solution Sets for Inverse Linear-Quadratic 
Differential Games 


This section presents general solution sets for inverse LQ differential games such that the 
problem of Definition 5.3 is solved. Similar to Chapter 4, available results on the conditions for 
feedback Nash equilibria shall be exploited. In the case of an infinite-horizon LQ differential 
game, the conditions are available in the form of coupled algebraic Riccati equations (ARE). 


5.2.1 Coupled Algebraic Riccati Equations 


The following theorem is introduced as a basis for the development of the results of this 
chapter. 
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Theorem 5.1 (Necessary and Sufficient Conditions for Feedback Nash Equilib- 
ria) 

Let there exist an N-tuple of symmetric matrices P;,i € P satisfying the N matrix algebraic 
Riccati equations (ARE) 


PiF + F'P; + >) P;B)Rj; RR} B7 P; + Q; = 0 (5.7) 
jEP 


such that F is stabilized. Furthermore, let K; be defined as 
K} = R; B] Pi. (5.8) 


Then, K* = (K{, ..., Ky) is a FNE as in Definition 5.1 and Ji(xo, K*, Q;, Rij) = xg Pixo. 
Conversely, if K* is a FNE then the set of ARE (5.7) has a stabilizing solution. 


Proof: 
See the proof of [EBS00, Theorem 4]. 


Remark 5.2: 


The ARE given in (5.7) are an alternative and equivalent formulation of the ARE given in (3.75). 


Both expressions are common in differential game theory. 


Theorem 5.1 represents a necessary and sufficient condition for feedback Nash equilibria. 


Hence, if the feedback matrices K* are given, the cost function parameters in the matrices 
Rij and Q;, ij € P, must fulfill (5.7). This fact shall be leveraged in order to develop a 
method to solve the inverse LQ differential game. Inspired by [JAK89] and [AKFIJ12], where 


numerical techniques for continuous-time Riccati equations and results on the properties of 


Sylvester and Lyapunov type algebraic equations were introduced, respectively, Kronecker 


products shall be applied to derive a reformulation of (5.7) which serves as a basis for the 


subsequent results. 


Reformulation of the Algebraic Riccati Equations 


Before presenting the reformulation, let us define a Kronecker sum [Bre78] as 


X8Y=(X81)+(I,®Y), (5.9) 


for squared matrices X € R’”’ and Y € RT“@, where I, denotes a q-dimensional identity 


matrix and @ is the Kronecker product. In order to develop a reformulation of (5.7), we 


require the following result. 
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Lemma 5.1 (Inverse Existence) 


Define Fg := F' @ F" where F is calculated by means of (5.3) with any tuple of feedback 
matrices K* € F (cf. (3.74)). The inverse Fz} exists. 


Proof: 

F% exists if all eigenvalues A; € o(Fe), | € {1,...,n?} are different from zero. By using 
[Zha11, Theorem 4.8], we discern that A, = pu; + ix, where uj, ur € o(F), for j,k € {1,..., n} 
such that l is associated to a particular combination of j and k, i.e. j = [41 and k = !-n(j-1). 
Since only stabilizing feedback matrices belonging to the set F in (3.74) are considered, F is 
a stable matrix and thus A; < 0,V/ € {1,...,n?}. The lemma assertion follows. oO 


Unless otherwise stated, the following calculations are with respect to a particular player 
ie P. With the results of Lemma 5.1, the matrices 


Zi := (In 8 B])F € R (5.10) 


and 
K? := K] Q K] eR""i (5.11) 


are defined. Furthermore, K; is written as K; in (5.11) and in the following lemma for 
brevity. 


Lemma 5.2 (Equivalent Formulation of the ARE) 


Let the parameter 0; € R" denote the vectorized matrices of the cost function (5.4), i.e. 
8; = [vec(Q;)" veo(Rix)” e veoRis)" ++ vedRm)'| , (65.12) 


N 
where vec(X) represents a column vectorization of a matrix X, leading to L = n? + X, m?. 
i=l 


Then, the matrices Q,,R;j, i,j € P, corresponding to 0; satisfy (5.7) if (and only if) 6; 


fulfills 


ij» 
M,d;=0 (5.13) 
where M; € R"MiXL is given by 


M;:=|Z; ZiK? ++. ZiK®, (ZiKP+K;81,) ZiK®, © ZiK®|. (5.14) 


i+1 


5.2 Solution Sets for Inverse Linear-Quadratic Differential Games 75 


Proof: 
We rewrite (5.7) as 


0 = vec(P;F) + vec(F' P;) + > vec(P;B;R;; RijR;; Bj P;) + vec(Q,) 


jer 
0= [(F" 8 In) + (In ® F')] vec(P;) + > (Kj ® K7) vec(R;;) + vec(Q;) 
JEP 
and thus 
vec(P;) = -Fy vec(Q;) - 3 Fa Kj vec(Rij). (5.15) 
IEP 


The first equality follows from vectorizing (5.7), while for the second equality (5.8) was used 
and the following equivalence was applied: 


vec(XYZ) = (Z" @ X) vec(Y). (5.16) 
This equivalence holds for any matrices X, Y and Z with suitable dimensions [Bre78]. The 


third equality (5.15) follows with the results of Lemma 5.1 and the definitions given in (5.11) 
and (5.9). Now we rewrite (5.8) as 


(In ® By)” (Ki 8 Ip) vec(R;;) = vec(P;) (5.17) 
using (5.16). Inserting (5.17) in (5.15) results in 


Zivec(Q;)+ (Ki ® Ip) vec(R;;)+ >, ZK? vec(R;,) =0 (5.18) 
je? 


and thus (5.13) follows immediately with (5.14) and (5.12). 


The parameters 0; for which (5.13) holds are valid solutions of (5.7) for a given K;. Note that 
the feedback matrices K* = (Kj,..., Kx) completely characterize the Nash equilibrium tra- 
jectories x*(t) and u;(t), ie P. This follows from (5.1) fulfilling all conditions for admitting 
a unique solution for any N-tuple of continuous controls (5.2) [BO99]. Thus, the parameters 
6; are associated to a Nash equilibrium represented by either the feedback matrices or the 
state and control trajectories. 


5.2.2 Canonical Parameter Set 


The matrix Riccati equations (5.7) have multiple solutions which potentially represent differ- 
ent Nash equilibria [Wee01]. However, it is worth emphasizing that we are only interested 
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in all parameter tuples 6 which represent a specific Nash equilibrium. Bearing this in mind, 
the following theorem gives the main result. 


Theorem 5.2 (Canonical Parameter Set of Inverse LQ Differential Games) 


Let a LQ differential game be given by (5.1) and (5.4). Furthermore, let Assumption 5.1 hold 
such that Nash equilibrium control laws K* are given. Then, the canonical parameter set of 
the corresponding inverse LQ differential game is given by 


= U ker(M;), (5.19) 


ieP 


with convex boundaries such that R;; > 0, Vi E€ P. 


Proof: 

By inspecting (5.13) from Lemma 5.2 we can recognize that all parameters which satisfy 
the ARE lie within the kernel of M;, which depends on K*. Therefore, all possible cost 
function parameters of player i which lead to the known Nash equilibrium are given by 
span(v”, Bal vo), where d; represents the dimension of the kernel of M; with basis vec- 
tors v;. The set including the cost function parameters of all players corresponding to the 


Nash equilibrium represented by K* is thus given by (5.19). Oo 


Note that the results of Lemma 5.2 together with Theorem 5.2 allow for a simple proof of 
the well-known invariance of the Nash equilibrium in case any cost function parameter 6; is 
multiplied by a positive constant. 


Corollary 5.1 
The trajectories constituting a Nash equilibrium under N cost functions J;(0;), i € P, of an 


infinite-horizon LQ differential game will constitute the same Nash equilibrium for J;(0;) 
with 6; = cð}, Yci >0. 


Proof: 
This can be easily be seen from M;c;6; = ciM;ð; = 0 which does not affect ker(M;) nor 
©. o 


The results of Lemma 5.2 as well as Theorem 5.2 are derived with respect to the parameter def- 
inition in (5.12) which considers the most general case where no assumptions on the structure 
of the cost function matrices were made, e.g. symmetry. The characteristics of the differential 
game and in particular, the properties of the cost function matrices affect the dimensions of 
ker(M;) and consequently of the canonical parameter set ©. Therefore, in the next section, 
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some properties of inverse LQ differential games based on the possible structures of the cost 
function matrices are discussed. 


5.3 Properties of Inverse Linear-Quadratic Differential 
Game Solution Sets 


Cost function matrices in a quadratic cost function are largely assumed to be at least sym- 
metric. Furthermore, in many applications, these are assumed to be diagonal. Since these 
matrix properties reduce the number of unknown parameters, inverse LQ differential games 
and their solution sets shall be analyzed considering all possible cases for the cost function 
matrices. 


5.3.1 Preliminaries 


Let us define the variable M; € R* to denote the number of (non-redundant) parameters 
of a player’s cost function. The specific value of M; depends on whether the cost function 
matrices are symmetric or diagonal. We have 


2 2 ; 
many ` mitm , symmetric matrices 
M; =4n+ = mi, diagonal matrices (5.20) 
L, else. 


Since M; < L holds, the analysis of inverse LQ differential games is based on the vectors 
0; € RM: which have a potentially reduced dimension compared to the parameter vector of 
Lemma 5.2. The matrix M; € R"”iXMi is introduced accordingly as a possible modification 
of the matrix Mj. 


Remark 5.3: 


The vector 0; € R™: and the modified matrix M; € R"™*“: comply with Lemma 5.2 in the 
sense that 
M;0; =0 (5.21) 


holds. Consequently, the results of Theorem 5.2 and, obviously, Corollary 5.1 hold for these in- 
troduced variables as well. 


In the following, an example illustrating the introduced modifications is presented. 
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Example 5.1: 


Consider a 2-player LQ differential game with n = 2, mı = m = 1, where the cost functions 
are given by (5.4). By Lemma 5.2, we obtain M; = L = 6, leading to the vector 


= T 
ði = [Qan Oren Qag Qian Ra Rel. (5.22) 


where Qi(r,c) with r,c € {1,2} denotes the element of Q in the r-th row and c-th column. 
Furthermore, we have the matrix 


M: = |m)ı (m). > (mi)e|, ie {1,2}, (5.23) 


where (m;);, j € {1,2, .., L} denotes the j-th column of Mj. 


Diagonal Matrices 
In case of diagonal matrices, Qi (2,1) = Qi,a,2) = 0, i € {1,2}. Therefore, the reduced non- 
redundant parameter vector has the dimension M; = 4, i € {1,2}, and is given by 


T 


0i = [Qran Qrey Ra Ra]. (5.24) 


Thus, we set 
Mi = |(m;) (m) (m); (mi)s]|, ie {1,2} (5.25) 


such that (5.21) is fulfilled. 


Symmetric Matrices 


In case of symmetric matrices, Qi 2,1) = Qi,(1,2), i € {1,2}. This leads to a reduced non- 
redundant parameter vector with the dimension M; = 5, i € {1,2}, and given by 


T 


0; = [O:«, 1) Qaa Qi(2,2) Ra Riz] ; (5.26) 
Hence, we set 
M; = |m)ı (mj)2+(mi)3 (mi), (m) (m)|, iE {1,2}, (6.27) 
such that (5.21) is fulfilled. 
These modifications allow for the analysis of inverse LQ differential games and their solution 


sets in the case of symmetric or diagonal cost function matrices by means of the kernel of 
Mi. 
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5.3.2 Sufficient Condition for Solution Sets 


In the following, all possible parameters 0; which lead to the same Nash equilibrium, pro- 
vided all other parameters 0_; are fixed, is denoted as the solution set of player i € P. This 
solution set is defined by the non-trivial solutions of (5.21). Therefore, one way to charac- 
terize these solutions is using the kernel of M;. Its dimension will depend on the number of 
linearly independent equations generated by the nm; rows of M; compared to the number 
of unknown parameters M;. Since rank(M;) < min(M;,nm;), the number of players, states 
and controls of each player as well as the assumed properties of the cost function matrices 
are important for evaluating the existence of inverse differential game solutions. 


Proposition 5.1: 


The solution set of player i is at least one-dimensional if the number of rows ofM; is strictly 
less than the number of parameters in 0), i.e. ker(M;) + 0 ifnm; < Mj. 


Proof: 
The condition nm; < M; implies rank(M;) < Mj, leading to dim(ker(M;)) > 0. 


Proposition 5.1 gives a sufficient condition for the existence of vectors spanning the kernel 
of M;. The exact dimension of the kernel is defined by rank(M;). The following example 
illustrates the results of Theorem 5.2 and the solution set concept. 


Example 5.2: 


Consider an infinite-horizon LQ differential game where two players control a double- 
integrator system given by 


ud) + |. | u(t). (5.28) 


0 
1 


0 
1 


x(t) = f ‘| xe 


The cost functions of the two players are given by (5.4) with Q, = diag(1,2) and Q, = 
diag(1, 0.7) as well as Ry, = 1, Rı2 = Ro, = 0 and Rọ = 1. The parameter vector of player i 
is given by 

0i = [Qan Qi,2) Rul, ie {1,2}. 


The game is solved by calculating the solution of the finite-horizon version of the game, i.e. 
solvin the corresponding RDEs (3.69), and extracting the converged value of P; afterwards. 
The resulting K* represents a Nash equilibrium since the calculated P; satisfies (5.7) for all 
players and the closed loop stability of the system dynamics was confirmed (cf. Theorem 5.1). 
The calculated Nash equilibrium is (K{,K3) =([0.5773 1.2827] , [0.5774 0.5882]). 
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The kernels of the matrices M; € R?” are defined by the span of the vectors 


vy) = [01 ]j=1,23 = [0.4083 0.8165 0.4083] (5.29) 
P = [vl jaa = [0.6337 0.4437 0.6337] " (5.30) 


which result in the canonical parameter set 
O = {Hi0;,HiRiine MER”, (5.31) 
7 f . A f ) 0 = 
which consists of the solution sets of player 1 and 2 and where Q; = diag(v\ r v) and Rii = 
u. This means that the cost function parameters are unique up to a constant parameter. In 
particular, pı = 2.4494 and j2 = 1.5779 lead to the defined ground truth parameters. 


As mentioned in the introduction of this section, the number of unknown parameters depend 
on the properties of the matrices, which in turn have an influence on the possible dimensions 
of each player’s solution set for the inverse LQ differential game. This aspect is further ex- 
amined in the following. 


General Cost Function Matrices 


In the case of arbitrary cost function matrices Mj = L = n? + Diep m; holds. Since nm; < 
0.5(n? + m?) < n? + Djep m; for any choice of n, mj, Vj € P and N € N+, dim(ker(M;)) > 0 
follows. The sufficient condition of Proposition 5.1 is fulfilled. 


Symmetric Cost Function Matrices 


If we assume symmetry of all cost function matrices, then M; = 0.5(n? + n + Diep m5 +mj)). 
Since 


nm; < 0.5(n? +m?) < 0.5| n(n +1) + ` m;(m;+1)|= Mi 
vee 
for any choice of n, mj, Vj € P, and N € N*, dim(ker(M;)) > 0 holds. The sufficient 


condition of Proposition 5.1 is fulfilled and the solution set of player i can be given in terms 
of the vectors v; which span the kernel of M;. 


Diagonal Cost Function Matrices 


Only in the case of diagonal matrices, where M; = n + Ljep mj, combinations of n, m;i, 
N exist such that nm; > Mj, thus potentially leading to an empty solution set. Here we 
note that if rank(M;) = M; - 1, then the solution set of player i is one-dimensional and a 
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Q Equations n mj 


Parameters M; 


40 | 


20 _ 


Nr. of parameters/equations 


(a) Symmetrical cost function matrices (b) Diagonal cost function matrices 


Figure 5.1: Number of parameters and equations in the ILQDG problem depending on the number of states and 
controls in a one-player LQ differential game. The red thick line/dot denotes the cases where nm; = 
Mi - 1. 


unique algebraic solution for player i’s parameters may be found by setting 0; j) = 1 for one 
particular j € {1,....M;} and proceeding analogously to [MZ18, Proposition 1], where the 
special case N = 1 is considered. This is possible e.g. if n = 1 and m; = 1 (besides N = 1). 


The analysis of the sufficient condition for symmetric and diagonal cost function matrices is 
illustrated in Figure 5.1 for the case N = 1. The number of equations (rows of M;) and the 
number of parameters M; are shown as a function of the number of states n and the number 
of controls m;. In Figure 5.1(a), which depicts the case of symmetrical cost function matrices, 
the number of parameters M; is always greater than the number of equations n m; such that 
the solution set of player 1 is at least one-dimensional. In Figure 5.1(b), which depicts the case 
of diagonal cost function matrices, we observe that there are combinations of n and m; which 
lead to nm; > Mj, thus not yet allowing for any conclusion concerning the solution set. In 
turn, the situations where the kernel of M; is guaranteed to not be empty in this scenario are 
represented by the red thick line. It denotes the cases where nm; = M;—1 < M; which fulfill 
the sufficient condition of Proposition 5.1. 


These 3D maps are altered if N > 1 and in the general case where each player penalizes the 
other players’ controls, i.e. Rj; + 0 for i + j,i € P. The cases N = 2 and N = 3 are shown 
in the Appendix C for further illustration of how the properties of M; are affected by the 
number of players, states and controls. 


Remark 5.4: 


The previous analysis shows the implications of nm; < M; as a sufficient condition for the 
existence of a solution set for player i which is at least one-dimensional. The casenm; > M; 
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demands further attention, given that it potentially leads to an empty kernel of M;— this occurs 
if rank(M;) > Mi. Nevertheless, this does not imply that a solution of the inverse differential 
game problem for player i does not exist. Indeed, the existence of a Nash equilibrium described 
by K* implies the existence of at least one N-tuple 0 = 0* which generated the equilibrium. 


In light of Remark 5.4, the next section presents a formulation of inverse LQ differential 
games which allows to find a solution of the inverse differential game problem regardless of 
the presented properties. In addition, it facilitates the derivation of further general results 
concerning the solution sets of each player. 


5.4 Quadratic Programming Formulation for Inverse 
Linear-Quadratic Differential Games 


The approach is based on the formulation of a residual function, analogously to Definition 
4.4, which denotes the extent to which the necessary and sufficient conditions for Nash equi- 
libria are violated. Since the conditions are represented by the coupled ARE (5.7) and its 
reformulation (5.21), where the matrix M; depends on the given matrices A, B;, ie P and 
K* = (Kj, ..., Ky), the following residual is introduced. 


Definition 5.4 (Residual) 
Let a function r; : RM œ R” "i je P, be defined as 


r;(0;) = M;0;. (5.32) 


The function r; is called residual of the coupled ARE (5.7). 


The violation of the coupled ARE in terms of the residual function occurs if the parameters 6; 
do not represent a Nash equilibrium for given feedback control laws K; and system dynamics 
matrices A and B;, i € P. While it would be possible to pose an optimization problem such 
that ||r ;|| is minimized, it is computationally more convenient to consider a quadratic residual 
function. The following lemma relates the quadratic residual function to the AREs. 


Lemma 5.3 


Let aLQ differential game be given by (5.1) and (5.4). Furthermore, let Assumption 5.1 hold. 
The ARE (5.7) is fulfilled if and only if ||M;9;||? = 0. 
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Proof: 


The proof is trivial given that the norm of a vector is zero if and only if the vector itself is a 


zero-vector. 


In light of Lemma 5.3, the optimization problem 


1 
min Ir): = min 501 Hi0; 


By > Vie {1 .. Mi}, 
Ri > 0 


is posed, where H; = 2(M; M;) € R“*™:_ Analogously to the residual-based approach in 
Section 4.3.1, the aim of the optimization problem (5.33) is to minimize the quadratic residual 
to obtain parameters 0; which fulfill the ARE. 


Remark 5.5: 


The constraints 0;(;) > 0, Yj € {1,..., Mi}, in (5.33) are introduced in order to avoid trivial 
solutions. Literature in inverse optimal control and inverse games often introduce the constraint 
Oig = 1 for any j € {1,....M;} (see note 25 in page 51). Analogous results concerning the 
properties of (5.33) can easily be proved with this (additional) constraint. Also note that, in case 
of diagonal cost function matrices, 0; q; > 0, Yj € {1,...,Mi}, ensures Rj; > 0. 


5.4.1 Necessary and Sufficient Conditions for One-Dimensional 
Solution Sets 


In the following, the quadratic program (5.33) is leveraged to obtain insights on inverse LQ 
differential games. The properties of the quadratic program (5.33) differ considerably de- 
pending on whether rank(M;) is less, equal or greater than the number of parameters M;. By 
considering the case nm; < M;, which leads to rank(M;) < Mj, the following proposition 
can be stated. 


Proposition 5.2: 


Let a LQ differential game be given by (5.1) and (5.4) such that nm; < M;. Then, the 
quadratic program (5.33) is convex and a solution is guaranteed to exist. 
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Proof: 

It is clear that both the constraint set defined by 0; g) > 0, Vj € {1,...,Mi}, and Ri; > 0 are 
convex and therefore, their intersection is also convex. Under the conditions nm; < M; we 
obtain rank(H;) = rank(M} M;) < min(nm,,M;) = nm; < Mi, leading to a convex—since 
M; M; = 0—but not strictly convex objective function. Hence, the quadratic program is con- 
vex and therefore always has a solution. o 


The results of Proposition 5.2 are not surprising for the case where Assumption 5.1 holds, 
since this guarantees that at least one solution for the parameters 0; of a particular player 
i € P (and the ones generated by a multiplying positive constant) must exist. Note that 
solving the optimization problem (5.33) leads to one of the solutions belonging to ker(M;) 
(cf. Proposition 5.1 and Theorem 5.2), but it does not give any information on the dimensions 
of each player’s solution set. 


The following theorem is stated as the main result regarding the canonical parameter set of 
inverse LQ differential games. 


Theorem 5.3 (Necessary and Sufficient Conditions for Uniqueness up to a Pos- 
itive Factor) 


Let aLQ differential game be given by (5.1) and (5.4). Furthermore, let Assumption 5.1 hold. 
The inverse LQ differential game has a canonical parameter set of the form 


O = {c;0;; ci > 0, i EP}, (5.34) 


if and only ifn m; > M; — 1 and additionally rank(M;) = M; — 1. 


Proof: 

We first state that nm; > M; — 1 is a necessary condition for unique solutions since nm; < 
M; - 1 leads to a solution set of a dimension greater than 1 (cf. Proposition 5.1). By the 
results of Lemma 5.3, (5.7) is fulfilled if and only if ||M;0;||? = 0. We therefore proceed to 
analyze the quadratic program (5.33). Under the theorem condition rank(M;) = M; - 1 we 
have dim(ker(M;)) = 1 which implies a one-dimensional solution set for each player i € P 
of the form (5.34). 


The case rank(M;) < M; - 1 leads to solution sets with a dimension greater that 1 and is 
therefore excluded. Therefore, only the case rank(M;) = M; remains which we analyze using 
(5.33). If rank(M;) = M;, which is only possible if nm > q, then we obtain H; > 0 and thus 
(5.33) is strictly convex. Strict convexity leads to a unique solution of (5.33) and therefore to 
a unique solution of the ARE (5.7). But the latter contradicts Corollary 5.1, from where we 
conclude that rank(M;) = M; - 1 is also necessary. o 
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Theorem 5.3 gives necessary and sufficient conditions for the solution set of each player i to 
be one-dimensional, i.e. each player’s parameters 0; are unique up to a real positive factor 


Cj. 


Summarizing the results of this subsection, if the canonical parameter set has the form (5.34), 
then a particular 0; belonging to the corresponding solution set each player i can be com- 
puted by means of the quadratic program (5.33). If the conditions of Theorem 5.3 are not 
fulfilled, then with the results of Proposition 5.2, (5.33) yields any solution from the canoni- 
cal parameter set with non-unique parameters for each player i € P. 


5.4.2 Identification of Feedback Matrices 


The optimization problem (5.33) always yields a solution which is associated with a given 
Nash equilibrium represented by K*. If only observed Nash equilibrium control and state 
trajectories are available, then it becomes necessary to estimate the control laws K*. For the 
N-player inverse differential game at hand, a least-squares identification based on (5.2) is 
proposed. For this purpose, let us introduce a finite sequence of sampling times 


Ti := {tk € [0,T]:1 <k < Kj; A0<t, <... < tx, <T} (5.35) 


for each player i € P, where [0, T] is the time interval for which x*(t) and u}(t) are available. 


Let the value of the state and control trajectories at tg be denoted by x!* and ul, respectively. 
Then, the feedback matrix can be estimated by means of 


Ki 
> g k 
K; = arg min), IKK + ull 2, (5.36) 
Ki k=1 
where || - || denotes the Euclidean norm. Least-square estimation theory states that the pa- 


rameters (in this case the entries of K;) can be recovered if persistence of excitation (PE) 
conditions are fulfilled [AW95, Section 2.4]. These conditions demand that the trajectories 
of x and u; are "informative" enough and are e.g. not identical to zero. Furthermore, if the 
least-square estimation is considered from a stochastic point of view, i.e. 


ul! = ~K;x!*1 + €;, (5.37) 
where e; € R™ denotes a vector of zero-mean Gaussian white noise, then the estimation is 
biasfree if e;(t) is independent of the state x(t) [AW95, P. 47]. The conditions for a bias-free 
estimation are usually not given. For example, the state x(t) depends on the controls u;(t) 
due to the system dynamics and is therefore not independent of the additive gaussian noise. 
Nevertheless, the LS estimation works well in practice, as shown later in Chapter 7. 
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5.4.3 Algorithm and Example 


The inverse LQ differential game method for determining a particular solution parameter 
vector 0; of player i based on (5.33) can be implemented with the following algorithm. 


Algorithm 2 IOC based method for player i in an inverse feedback LQ differential game. 


Input: State and control trajectories x(t) and (uı(t),...,un(t)), system matrix A, input ma- 
trices B;, Vi € P. 
Output: Computed player i cost function parameters 0;. 
1: Estimate Kj for alli € P with (5.36) and determine the corresponding closed-loop system 
matrix F with (3.67). 
2: Compute matrices Z; with (5.10) and K? with (5.11). 
3: Compute matrix M; with (5.14). 
4: Solve the quadratic optimization problem (5.33). 


Note that, similar to the methods presented in Chapter 4, Algorithm 2 may be used for cost 
function parameter identification of any player i € P in an N-player infinite-horizon LQ 
differential game. Furthermore, the method may also be applied for the special case of a 
single player, i.e. an inverse LQ optimal control problem. The core of the presented approach 
is the quadratic program which can be solved very efficiently with state-of-the-art methods, 
e.g. active-set and interior point methods [NW06, Chapter 16]. 


This section ends by the presentation of an example to illustrate Theorem 5.3 and the use of 
Algorithm 2 to identify cost function parameters in an inverse LQ differential game. 


Example 5.3: 


Consider an infinite-horizon LQ differential game where 2 players control a stabilizable linear 
system defined by the differential equation 


x(t) = i zl x(t) + 


r | esl” | u(t) (5.38) 


0 


and select their feedback strategies according to a cost function of the form (5.4) with cost 


eri 


function matrices 


01 0 10 
1 0 2 0 (5.39) 
Ri = Rie f 
1 k | a k | 


5.4 Quadratic Programming Formulation for Inverse Linear-Quadratic Differential Games 87 


The vectorization of the cost function matrices according to (5.12) leads to a parameter vector 


of dimension M; = 4 given by 
6: = [Qan Qies Ruan Ruaa), i€ {1,2}, 


where Q; (;,;) and R;;(;,j) denote the j-th diagonal entry of Q, and Ri;, respectively. Analo- 
gously to the last example, the infinite-horizon LQ differential game was solved by calculating 
the solution of the corresponding RDEs (3.69) and extracting the converged value of P;. The 
resulting state and control trajectories x*(t) andu;(t) were confirmed to correspond to a stable 


system and hence, to a Nash equilibrium. 


In this example, the inverse method is given the resulting state and control trajectories x*(t) 
and u;(t) instead of the Nash equilibrium feedback matrices K*. Following Algorithm 2, 
these trajectories were used to estimate the feedback matrices with the LS approach given in 
(5.36), where a set J; with T = 10 and K; = 501 was selected according to (5.35). The Nash 
equilibrium can be exactly estimated with deviations IIK: - K;|| < 1074 for alli = {1, 2}. 
With K* = (Kj, K3), we obtain the matrices 


-0.436 -0.026 0.466 0.004 
0.100 -0.027 0.053  -0.126 

M, = (5.40) 
0.100  -0.027 -0.078 0.006 


0.032 0.153 -0.020 0.204 


and 
—0.436 -0.026 0.530 —0.365 
0.100 -0.027 0.144 -0.114 
M: = (5.41) 
0.100 -0.027 0.264 -—0.353 


—0.032 —0.153 —0.048 1.655 


We find that rank(M;) = M; holds for i = {1,2}, which indicates a one-dimensional solution 
set for each player i according to Theorem 5.3. By solving the quadratic program (5.33) we 
obtain the parameters 


6, = [1.000 1.000 1.000 1.000] 
3 (5.42) 
6, = [0.602 6.024 1.204 0.602]. 


The parameters 0; were exactly identified, while for the second player, the parameters are 
equal up to a multiplying constant. In particular, we have 0. = 0.6024 03. 
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5.5 Method Limitations 


Prior to this chapter’s conclusion, possible limitations of the method are discussed. The first 
issue is given, similar to last chapter, if e.g. only noise-corrupted measurements of the state 
and control trajectories are available. Nevertheless, since the method relies on the feedback 
control laws and these are estimated by the LS method, it can be conjectured that the method 
has a considerable robustness to noise in the trajectories. This case shall be further examined 
in Section 7.5. In addition, truncated trajectories do not represent a problem as long as these 
fulfill the PE condition mentioned in Section 5.4.2. Informative trajectories can potentially 
fulfill this condition even with a small number of values. 


A further issue arises if an i € P exists such that K; does not constitute a Nash equilibrium 
feedback law with respect to any set of cost function matrices Q,, R;;, and Rj; of the assumed 
structure, e.g. symmetric. More generally, K; might not be a Nash equilibrium for any set 
of cost function matrices, regardless of their structure. This can occur e.g. if K; is identified 
from trajectories x(t) and u;(t) which do not represent a Nash equilibrium. However, by the 
results of Proposition 5.2, the existence of a solution to the quadratic program (5.33) is guaran- 
teed, independently of the Nash character of the control laws. Since the presented quadratic 
programming approach is based on the coupled ARE which are necessary and sufficient con- 
ditions for feedback Nash equilibria, the identification results yield parameters which lead to 
the Nash equilibrium feedback law which is the closest to the original observed feedback law. 
The distance is measured in terms of the violation of the coupled AREs (cf. the discussion of 
the experimental results in Section 8.8). However, this distance may not be proportional to 
or correlate with the error between observed and identified trajectories. 


5.6 Conclusion 


In this chapter, the inverse problem of infinite-horizon LQ differential games was consid- 
ered, where a feedback Nash equilibrium is given and cost function parameters are sought 
which explain this resulting equilibrium. The parameters correspond to the elements of the 
matrices of the quadratic cost functions of the players and the Nash equilibrium is assumed 
to be given in the form of an N-tuple of player feedback matrices. The solution of the in- 
verse LQ differential game was given in the form of an explicit set—the canonical parameter 
set—which describes all possible cost function parameter vectors or matrices which lead to 
the same Nash equilibrium, and was achieved by a reformulation of the necessary and suffi- 
cient conditions for Nash equilibria. Importantly, sufficient conditions for the possibility of 
stating such explicit sets were given. In addition, these results were applied to formulate a 
quadratic program which allows an efficient computation of the cost function parameters. 
Moreover, the analysis of the resulting quadratic program allowed for the statement of nec- 
essary and sufficient conditions for the uniqueness of the solution set of a particular player 
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up to a multiplying positive constant. Finally, it was demonstrated that the feedback matrices 
of all players can be estimated out of Nash equilibrium state and control trajectories by using 
a least-squares approach. Consequently, all of the results developed in this chapter can be 
applied if, instead of the player feedback matrices, observations of Nash equilibrium state 


and control trajectories are available. 


The results of this chapter represent solutions related to one of the questions Kalman stated: 
"What optimization problems lead to a constant, linear control law?" (Problem A in [Kal64]). 
This problem was recently considered in [MZ18] for single-player infinite-horizon problems; 
these results have been generalized for N-player differential games in this chapter. 


6 Inverse Dynamic Games Based on Inverse 


Reinforcement Learning 


This chapter presents inverse dynamic game solutions such that cost functions which explain 
observed behavior of several players can be found. The methods presented in this chapter 
are based on inverse reinforcement learning techniques and on a discrete-time formulation 
of the infinite dynamic game. Therefore, the methods in this chapter represent an alterna- 
tive approach to the IOC-based methods of the previous two chapters. Nevertheless, there 
is a similarity to the results of these aforementioned chapters, namely the development of 
an inverse dynamic game method which does not rely on a repeated solution of the forward 
problem, i.e. the repeated computation of Nash equilibrium state and control trajectories. 
After a short introduction to the principle of maximum entropy, which represents the basis 
of the methods, the main contribution of this chapter is shown, namely the derivation of a 
probabilistic method for inverse dynamic games based on Maximum Entropy Inverse Rein- 
forcement Learning (MaxEnt IRL). The cases where the players’ behavior corresponds to an 
open-loop and a feedback Nash equilibrium are considered. In addition, results on the unbi- 
asedness of the estimation of cost function parameters are presented. After providing further 
details which are important for the practical implementation of these methods, examples for 
the solution of inverse linear-quadratic dynamic games are given. The chapter ends with 
conclusions on all presented results.’ 


6.1 Introduction to the Probabilistic Approach and 
Maximum Entropy 


In this thesis, the aim is the development of IRL methods for inverse dynamic games which 
allow for continuous-valued control and state spaces, such that comparable methods to the 
ones presented in Chapters 4 and 5 based on inverse optimal control can be obtained. The 
inverse dynamic game problem is regarded in this chapter from a probabilistic perspective 
which is introduced in the following. 


33 Preliminary versions of the results of this chapter have been published in the conference paper [KIR*17]. The 
chapter’s contents are based on the article [IBKH20]. 
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EW E() Ee) Em E() 


Figure 6.1: Example of a probability function for trajectories 


For a simplified presentation, consider the results of a dynamic game as a single trajectory 
ë (t) which is assumed to stem from a probability function P(£) defined over a finite and dis- 
crete set of (in this case five) possible trajectories &(t). This scenario is illustrated in Figure 
6.1, where the observed trajectory E(t) = E® is colored green. In this example, a probability 
value is assigned to each of the five possible trajectories. Transferring this line of thought to 
an inverse problem in dynamic games leads to the fact that one or several trajectories č are 
observed, but their probabilities are unknown. The choice of a probability function which ex- 
plains these observed trajectory is not unique, even if some constraints are introduced. This 
problem becomes even more complex if the trajectories originate from a probability density 
function p(£) instead of the previously presented probability mass function P(£) since this 
implies a potentially infinite number of possible trajectories. In order to resolve the ambi- 
guity in this kind of problem, the principle of maximum entropy can be applied. This was 
introduced by Jaynes in [Jay57] as a means to infer probability distributions which are con- 
sistent with experimental data.** According to Jaynes, this method leads to the “least biased 
estimate possible on the given information”. This is illustrated e.g. by the fact that the distri- 
bution which maximizes the entropy with the constraints of fixed and known expectation and 
variance is the Gaussian distribution. Similarly, the maximum entropy distribution where no 
constraints are introduced is the uniform distribution [CT06, Section 12.2]. 


34 Jaynes’ objective was to present a potential application of information theory results—obtained by Shannon 
([Sha48])— to the field of statistical mechanics. The interested reader is also referred to [PGLD13] for a historical 
review. 
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This introduced probabilistic perspective of dynamic games constitutes the basis of the defi- 
nition of the problem. Likewise, the principle of maximum entropy shall be leveraged for the 
development of inverse dynamic game solutions presented in the next sections. 


6.2 Problem Definition 


Consider an infinite dynamic game in discrete time’, where N players simultaneously con- 
trol a system with (potentially time-variant) dynamics of the form (see also Definition A.1) 


xr) = 6) ix, ere u) (6.12) 


x) = x]. (6.1b) 


The goal of each player i € P is to minimize its individual cost function by applying a control 
strategy. The cost functions’ structure is assumed to be defined by a linear combination of 


M; € N known features”, i.e. 


kE 
k k 
K=-X pi (Pu...) (6.2) 
k=1 


where kg € Nso, @; € R™ contains all features of player i defined analogously to Definition 
4.2 and 0; € R™ represents the vector of player i’s individual feature weights, i.e. the cost 
function parameters. 


A main element of inverse problems in optimal control and dynamic games are the observed 
state and control trajectories. Generally speaking, a trajectory consists of a sequence of values 
according to the discrete-time formulation of the game. Therefore, the following definition 
is introduced. 


35 The discrete-time formulation is chosen following the line of a vast number of previous studies on single-player 


IRL (cf. Section 2.1.3). The results of this chapter are based on definitions analogous to the ones in Chapter 3. 
These discrete-time dynamic game definitions are given in Appendix A. 

In this chapter, the term features is used instead of basis functions in order to be consistent to IRL literature. 
Furthermore, in the following it is assumed that the feature functions in @; are independent of k. Their cor- 
responding values are still stage-dependent through the values of the states and the controls. In addition, note 
that the cost function has been multiplied with a factor of -1. This is done in order to be congruent with IRL 
literature which assumes a reward function to be maximized instead of a cost function to be minimized. 


36 
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Definition 6.1 (Stacked State and Control Values) 
Let 


e R"ke, (6.3) 


— 


x=[(e0)" ... T 


u, = (ur) 22 (u) T e RKE, (6.4) 


Vi € P, be vectors containing all values of the system state x‘") and the control values u" 
of player i € P for all time steps k € K, respectively. 


Furthermore, the following notation is introduced for a set of trajectories in accordance with 
the system dynamics (6.1) which will facilitate a more compact representation of the results 
of this chapter. 


Definition 6.2 (Trajectory Set) 


A trajectory ¢ := {x, üs uy} is defined as a set containing the stacked values of the 
system state x and the stacked control values u, of all players i € P, which is feasible with 
respect to the system dynamics given by (6.1). 


The estimation of the cost function parameters 0; is based on an observed set of trajectories 
denoted by Č := {x,i,,...,u,,} which, following the probabilistic approach presented in 
Section 6.1, is assumed to be sampled by a probability density function p (| 7, ....0%,) with 
unknown parameters 93, ..., OÑ- 


A further key value in IRL problems is the feature count (cf. [AN04, RBZ06, ZMBD08] in the 
single-player case) which is introduced in the following. 


Definition 6.3 (Feature Count) 


The feature count u, (£) € R™ of a player i € P along a trajectory ¢ is defined as a vector 
containing the accumulated values of the features along that trajectory, i.e. 


ke 
w= >) oi (eal, a | (6.5) 
k=1 


with xu El, Vie P, KEK. 


Using the feature counts u, (¢) and (6.2), the costs along a trajectory ¢ for any player i € P 
can be rewritten as 


Ji (G, 0:) = -9i p; (2). (6.6) 
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In the following and with some abuse of notation in favor of better readability, p (¢| 01:n) 
represents the probability density of a trajectory ¢ as a function of parameters 0;,...,0N 
corresponding to the cost functions Jj, Vi € P. 


Having introduced these basic definitions, the inverse dynamic game problem considered in 
this chapter is defined as follows. 


Definition 6.4 (Inverse Dynamic Game Based on IRL) 

Find parameters ði, Yi € P, such that the expected costs of a trajectory sampled from the 
probability density p ( ¢| du) resulting from the identified parameters corresponds for each 
player i € P to the expected costs of the observed trajectory sampled from the probability 
density p (¢| 1.1), Le. 


E (cid) {i (4 8) } = E (zioz) HE), VER. (6.7) 


The requirement (6.7) arises from the demand of obtaining for each player a cost function that 
results in an individual performance as good as the observed one, where the performance is 
measured with respect to each player’s unknown true cost function J; (¢,6;) .”” Similar to 
the inverse differential game problem of Definition 4.3, Definition 6.4 implies that we are in- 
terested in finding one parameter vector 0; for each player i € P such that (6.7) holds, i.e. the 
dynamic game with identified cost function parameters is able to explain the observed tra- 
jectories. This differs to the problem investigated in Section 5.2 where the complete solution 
set for each player i € P is sought, since inverse problems in optimal control and dynamic 
games are naturally ill-posed. 


6.3 Maximum Entropy Distribution of Trajectories in 
Dynamic Games 


The principle of maximum entropy provides a means to resolve the ill-posedness issue such 
that parameters can be found which solve the problem given in Definition 6.4. In this sec- 
tion, we transfer the maximum entropy approach to inverse dynamic games with N players. 
The aim is to find a probability density function p(¢| 01:n) which represents the probabil- 
ity of trajectories ¢ as a function of the parameters 0),...,@n, yet considering (6.7) as only 
constraint or a-priori knowledge. Finding an expression for p (¢| 01:n) shall provide a useful 
result on our way towards the solution of inverse dynamic games with IRL. 


37 Similar objectives have been frequently defined in single-player IRL methods, see e.g. the seminal papers [NR00] 


and [AN04]. 
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In order to state a relationship between observed trajectories Č and the probability distribu- 
tion p (¢| 07.1) which generated them, the following assumption is made: 


Assumption 6.1 


The feature count of player i along the trajectory Č (denoted as u, for all players i € P) 
represents the expectation of the feature count EB, Llin) In; co} based on the probability 
density function p (| 07,,,) which results from the parameters 0;,...,0N, i.e. 


Ezio) iO} =m Vie P. (6.8) 


Assumption 6.1 means that each observation 4 1 is representative*®. As no further information 
is available, the sample mean is used as an estimate for the expectation of the feature count. 
Furthermore, note that Assumption 6.1 implies that if ny € N>o observed trajectories are 


given, ie. a set of trajectories D = (23, si le the expectation of the feature count of 
player i is given by 


Ecion) {iO} = = Xu (&) l (6.9) 
=1 


where u; (&) denotes the feature count of the observed trajectory with le {1,..., nr}. 


Lemma 6.1 (Path Feature Count Equivalence to Costs) 

Let the expectation of the feature count be equal for both the probability density p ( 4 6...) 
resulting from the identified parameters and the probability function p ( {| 07...) with orig- 
inal parameters 0}, ..., Oy, ie. 


B (cóin) {Hi ©} = EIo) TAO) (6.10) 


for each player i € P. Then, for any parameters where lell, < œ, (6.7) is fulfilled. 


38 A representative sample is a typical sample of a population [Mar91]. The latter means in this context all possible 


trajectories which can be generated from the assumed probability density function p ( ¢ | 07.,))- 


6.3 Maximum Entropy Distribution of Trajectories in Dynamic Games 97 


Proof: 
By rewriting (6.7), we can state the following relations: 


v= E (ziônn) Ui (0.87) } - Era Ui (.9})} (6.11) 
5 E (ziun) UA): = Exc ciety) {0: u; oN (6.12) 
< leil | (215.2) {OP Boca; {ai ©} (6.13) 


Therefore, if (6.10) holds, then the right side of (6.13) is equal to zero and hence, together 
with the inequality in (6.11), this implies that (6.7) holds as well. o 


Lemma 6.1 represents the principle of matching feature expectations for all players. This 
principle was introduced in [AN04] for N = 1 and used as a basis for numerous single-player 
IRL methods. 


Since the inverse dynamic game problem defined in Definition 6.4 demands the fulfillment of 
(6.7), by the results of Lemma 6.1 and using Assumption 6.1 we require 


Eyzlonn) {Hi O} = Bi (6.14) 


for each player i € P. Moreover, for a density function, 


f p(C|O1.n) do = 1 (6.15) 
ve 


must apply. Since the conditions (6.14) and (6.15) do not lead to a unique solution for the 
probability density function, the principle of maximum entropy is applied. For a continu- 
ous density function the entropy corresponding to a probability density function is given by 
[CT06, Section 8.1] 


h(p(¢| @1.n)) = [a 9.n)In(p(Z101.n)) dd. (6.16) 


In order to determine a probability density function p (¢| 01:n) which only takes the informa- 
tion of (6.14) and (6.15) into consideration, the differential entropy (6.16) is maximized with 
the requirements (6.14) and (6.15) as optimization constraints. The density function which 
leads to maximum entropy in dynamic games is presented in the following lemma. 
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Lemma 6.2 (Maximum Entropy Probability Distribution in Inverse Dynamic 
Games) 
The maximum entropy distribution under the constraints defined by (6.14) and (6.15) is 
given by 


exp (EN) 
IM; (>: o7 u; o) a 
exp (UM 100) 


= , 
IM; (>: FIE 2 av 


i=1 


p(¢| iN) = 


(6.17) 


where the alternative representation given in the last equation follows from (6.6). 


Proof: 

A calculus-based approach is followed as suggested in [CT06, Section 12.1]. To maximize 
the differential entropy (6.16) under the constraints given by (6.14) and (6.15), we introduce 
Lagrange multipliers y € Rand 0; € RM”, Vi € P, and set up the objective function 


AEON), Y, Orn) = 


-f PIONI DA +y | Peat ee 
vč vč 


+65 ( S PED mO- i) +. (6.18) 


+ On (S Plein) ay O- ty) 


In this way, the expression 


— ~ = l 04. dé — =— d 
dp (Z| O1.n) L TAPASI P raa, a Del A 


d or d -+01 d 
fa z+ RG Er [m z ow 


N 
f ' -noci Ow) -14+ +X OT p; o) ay 
i=1 


=0 
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gives a necessary condition for the sought probability density function. By inspecting (6.19) 
we see that this condition is fulfilled if 


N 
—In(p (Z| O1.n)) - 1+ H+ )) OFM, =0. (6.20) 
i=1 


By reformulating (6.20), we obtain the probability density function of a trajectory ¢, ie. 


N 
p ({| O1.) = exp (-1 + Wexp ») o7 p, o) (6.21) 


i=1 


Using (6.21), (6.15) is rewritten as 


1= [Pd 


N 
2 2 Tu, d 
exp(-1+ Y) f ni [2 Hi o) 5 (6.22) 
1 


_ 
i exp (>: OF p, o) T 


Inserting (6.22) in (6.21) leads to the probability density function (6.17). The entropy is max- 


© exp(-1+y)= 


imized since 
OPA 


1 
Se th eaaa] l 


for all p (¢| 01:n) # 0. 


In order to obtain an estimate of the cost function parameters 6;, i € P, it may appear 
suitable to maximize the probability density function (6.17), analogously to similar 1-player 
IRL methods [ZMBD08, LK12]. However, given the dependence of (6.17) on the cost function 
parameters of all players, it is not possible to solve for a particular 0;. Nevertheless, if ¢ 
corresponds to a Pareto efficient solution according to Definition 3.9, then (6.17) can be used 
to identify corresponding parameters 6; which explain the observations. This approach is 
presented in Appendix D. 


The following sections present approaches to identify cost function parameters which explain 
observed Nash equilibrium trajectories. 
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6.4 Open-Loop Case 


In this section, we shall consider inverse dynamic games where each player applies an open- 
loop strategy (cf. Definition A.4) and an open-loop Nash equilibrium (OLNE) arises from their 
interaction. 


A suitable probability density function p(¢) is sought which allows for the estimation of cost 
function parameters. 


6.4.1 Probability Density Function 


The non-cooperative character of the dynamic game implies that each player only considers 
his own cost function and strives for its minimization by means of the selected open-loop 
strategy. From Theorem A.1 we see that the open-loop Nash equilibrium involves the solution 
of a set of differential equations which includes derivatives of the system dynamics and the 
features (which constitute the Hamiltonian) with respect to the system state x'*) and player 
i’s controls u . The other players’ controls do not depend on either of these, and therefore, 
they do not have any influence on player i’s actions. 


Consequently, the following probability function 
exp (J: (0) 


E 
_ xp (97H) (6.24) 


[ew (oi mid) a8 


is defined, which represents the probability (density) of a particular trajectory from the point 


p (21 0;) = 


of view of player i. This density implies that the probability of a particular trajectory is 
inversely proportional to the costs generated by player i’s own individual cost function J; 
defined by player i’s cost function parameter set 0;. This simplifies the probability density 
function p (¢| 01:n) in such a way that N probability density functions p (¢| 0;) which depend 
on each player’s cost function parameters 0; are considered instead of one single probability 
density function which depends on all parameters. 


Considering a possible total number of n; demonstrations, the following likelihood function 
is defined based on the introduced probability density function. 
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Definition 6.5 (Likelihood Function) 
Let a set of n, trajectories denoted by D = {ã, ee Gok be given. Then the likelihood of the 


data given a parameter vector 0; is defined as 


nt 


£0: | D)= | [p(&10:), (6.25) 


l=1 


where p (4 | 6.) is obtained by evaluating (6.24) at , l € {1,...,n;}. 


The likelihood describes the probability density of the trajectories when the parameters are 
set. Moreover, it is a function of 0;. With this function, the foundation for a maximum 
likelihood estimation (MLE) of the cost function parameters is given. In order to show that 
maximizing the likelihood leads to an unbiased estimation of the cost function parameters, 
the following assumption adapts Assumption 6.1 (and (6.9)) to probability density functions 
depending only on the parameters 0; of one player i € P as defined in (6.24). 


Assumption 6.2 (Expectation and Mean Equivalence) 


The mean of the feature count of the n; observed trajectories is equal to the expectation of 
the feature count of the trajectories resulting from the probability density function with 
original parameters 0}, i.e. 


By ¢|0;) fO} = > Lj (i). WIJER: (6.26) 
121 


6.4.2 Cost Function Estimation and Unbiasedness Results 


Before presenting the unbiasedness ofthe MLE as the main result for inverse non-cooperative 
dynamic games of this chapter, an alternative definition of the cost functions which will be 
convenient for the proof of the main theorem. 
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Definition 6.6 (Extended Features, Feature Count and Parameter Vector) 


Let & denote an extended feature vector which includes all features die Pig € 
{1,...,M;} of all N players such that dr) # dis) for allr,s € {1,...,dim(@)} andr + s. 
In other words, the extended feature vector & consists of the feature vectors p; of all players 
such that no feature is included more than once and all features are linearly independent of 
each other. The extended feature count a(¢) is defined analogously according to Definition 
6.3. Furthermore, let the extended parameter vector 0; be defined such that 


KO = ud) = du), ie P. (6.27) 


Remark 6.1: 


For (6.27) to hold, 0; has to include zeros in the positions corresponding to the elements of p 
representing features which were not in p; previously. 


Remark 6.2: 
Assumption 6.2 leads to 


nt 


Ex cjo;) EO} = ao 17 (à). NIE», (6.28) 
for the extended feature count pi(Z). 


The following theorem presents the method for estimating cost function parameters from 
open-loop Nash equilibrium trajectories and states the unbiasedness of the estimation. 


Theorem 6.1 (Unbiasedness of the Estimation) 


Let a set of trajectories D = (5 BER: Cay for which Assumption 6.2 is fulfilled be given. 
Then, the MLE with respect to the observed trajectories, i.e. 


6; = arg max In £{0;|D}, (6.29) 


i 


where L { 0;| D} is obtained by evaluating the likelihood function of Definition 6.5 at a, 
l € {1,...,nı}, leads to parameters 6; such that p (zi ôi) results in an expectation of the 


cost function values J; (2 67), Vj € P which is equal to the one corresponding top (¢| 97), 


Le. 
P (zi) p (e 0;)} = E (zie; Ui (2 o;)}. (6.30) 
holds for all i, j € P. 
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Proof: 
Using the extended parameter vector 0; (cf. Definition 6.6), (6.30) can be rewritten as 


E (zê: p (zð i) = Ex ¢10;) a (z) (6.31) 
for all i, j € P. Therefore, (6.31) shall be proved in the following. 


The maximization of the log-likelihood function (6.29) implies 


0+ = 2 In iid Q a) (6.32) 


d:=6; 
= ae ol exp Q (8; ao) az) +6, Td} (6.33) 
2 1 0;=0; 
if ep) 
=Y 2 ð l (6.34) 


m| [ex (8 md) of 


8:=6; 


Since the integrals in the numerator and the denominator in (6.34) are independent of each 
other, (6.34) can be rewritten as 


Leaf er (dr AO) He) 
od | — Ja ET) (6.35) 
= f exp (raO) af 
č 6;=0; 
Using the defined probability density function (6.24), we obtain 
= [- i p (£161) mS) dg + én) 
I=1 5 0;=0; 
=>, (E za) AO) + én) . (6.36) 


l=1 


By rewriting (6.36) and considering Assumption 6.2 and Remark 6.2, 


E (zi,) PO = Ya) = Eyan HO} (6.37) 
en 
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results. Therefore, the expectations of the feature count j are equal for both probability den- 
sity functions. By applying the results of Lemma 6.1 (which also hold for a probability density 
function p(¢| @;)) we conclude that (6.37) leads to (6.31) which is equivalent to (6.30). m) 


The results of Theorem 6.1 guarantee (6.30), which at first glance differs from the requirement 
(6.7) posed in the inverse dynamic game problem in Definition 6.4. However, for inverse open- 
loop dynamic games, it was proposed to consider N probability density functions p (¢| 07) 
instead of a single one given by p (¢| 61.,,). Therefore, instead of the equivalence of expected 
costs with respect to this initially assumed probability density function p (¢| 07.,,), we obtain 
the equivalence of expected costs for all players j € P with respect to each of the N proba- 
bility density functions p (¢| 07) as stated in (6.30). Consequently, the estimated parameters 
6; solve the inverse dynamic game problem for an open-loop information structure. 


Remark 6.3: 


Solving the optimization problem (6.29) demands the possibility of evaluating the likelihood 
function L{0;| D} and therefore the probability density function (6.24) at the trajectories (7. 
The denominator in (6.24) includes an integral over all trajectories č which are feasible with 
respect to the system dynamics and an initial state. Calculating this integral is intractable given 
the continuous-valued control and action spaces. Therefore, approximations are usually applied. 
This will be tackled in Section 6.6. 


6.5 Feedback Case 


In this section, solutions for inverse dynamic games with the feedback Nash equilibrium 
(FNE) as a solution concept are presented. Therefore, the MPS and feedback information 
structures according to Definition A.3 are considered. The resulting strategies are given 
by” 
k k 
u = yO), (6.38) 


The following assumption is needed for the results of this section. 


Assumption 6.3 (Control Laws) 
The Nash equilibrium control laws ys (x), k € K are known for all players i € P. 


39 According to [BO99, p. 278], the feedback Nash equilibrium solution under the MPS information pattern solely 


depends on x'%) at the time step k. The dependency on x is given only for k = 1. Therefore, we have feedback 
strategies as in Definition A.5 for both MPS and FB information structures. 
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For the case of a finite-horizon dynamic game, i.e. kg € N, Assumption 6.3 demands the 
knowledge of the exact (time-dependent) function y * (x). This case is analogous to As- 
sumption 4.3 for inverse feedback differential games. In case of an infinite-horizon (kg — ®) 
dynamic game, Assumption 6.3 implies that the time-independent functional relationship of 
y to x is known. 


Remark 6.4: 


Assumption 6.3 is rather restrictive for general nonlinear feedback Nash equilibria. However, 
not only the estimation of the control law is non-trivial, but also the calculation of the equilib- 
ria themselves which implies the solution of coupled partial differential equations (see Theorem 
3.2) or coupled Bellman equations (see Theorem A.2). On the other hand, Assumption 6.3 is 
not restrictive for infinite-horizon linear-quadratic dynamic games, since the Nash equilibrium 
controls are given by 

y (x) = Kix, (6.39) 


with K} € R”i*” [Eng05, Section 8.3]. As mentioned in Section 5.4.2, the estimation of K} can 
easily be performed by means of a least-squares approach. 


If Assumption 6.3 holds, the control laws of the players j € P, j + i can replace > in (6.1), 
leading to system dynamics from player i’s perspective defined as 


(FH) _ a (Ku Pane n) 


= fP (ee i) . (6.40) 


In this way, it is possible for player i to represent the system dynamics as a function of the 
system state x and his own control variable u;. The effect of the other players’ controls 
are considered due to the implied knowledge of the control laws and the system state in 
every time step. Analogously, the features @; of player i’s cost function can be rewritten as a 
function of the state x and the control variables u;, i.e. 


z= k) 4 (k) (k) 
d; = p;x ) uf Sek ite) 
k)« 
=; (x O u, yh (x )) (6.41) 
= 6 of ) 


where the same vector @; is used with some mathematical freedom in favor of a simplified 
presentation. Based on the system dynamics from player i’s perspective (6.40) and the rewrit- 
ten features (6.41), the following theorem is presented which describes the method for an 
unbiased maximum likelihood estimation of cost function parameters in an inverse feedback 
Nash dynamic game. 
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Theorem 6.2 (Unbiasedness of the Estimation) 


Let a set of trajectories D = {G; Mer fn, } be given such that Assumption 6.2 is fulfilled. 
Furthermore, let Assumption 6.3 hold such that the feedback Nash control laws yi" are 
known for alli € P. Then, the MLE with respect to the observed trajectories, i.e. 


6; = arg max In L {6;| D} (6.42) 


where L { 0;| D*} is obtained by evaluating the likelihood function of Definition 6.5 at č}, 
le {1,...,n¢} and with respect to the system dynamics (6.40), leads to parameters 6; such 


that 
E (ziè:) fa (£195) } = Ezio 5 (£195) (6.43) 
holds for alli, j € P (cf. Theorem 6.1). 


Proof: 
The cost functions J;, i € P can be rewritten using the modified features (6.41). Afterwards, 
the theorem can be proved analogously to Theorem 6.1. o 


6.6 Practical Aspects 


The results of the previous sections provide the theoretical foundation for the application of 
MaxEnt IRL for the solution of inverse dynamic game problems. The core of the method is 
the MLE based on the probability density functions p ( ¢| 07). The focus of this section is laid 
on the computation of the MLEs which yield cost function parameters 6; explaining observed 
results of a dynamic game. This poses the practical challenge of evaluating the probability 
density function (6.24) and with that result, the likelihood function (6.25). This is the main 
objective approached in this section. 


6.6.1 Approximation of the Probability Density Function 


The integral in the denominator of (6.24) is computationally intractable and therefore, an 
approximation is necessary. This may be achieved by replacing the integral with a sum over 
several trajectory samples which have to be generated from a previously defined probabil- 
ity distribution [KPRS13, MHB16] or determined in each iteration from a forward optimal 
control or dynamic game solution with current cost function parameter candidates [AB11]. 
Which sampled trajectories are chosen has a great impact on the estimation of cost function 
parameters (cf. [AB11]). In order to avoid the problem of choosing adequate samples, in this 
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thesis the integral and therewith, the probability density functions are approximated locally. 
The following procedure is inspired by the approach proposed in [LK12] for a single-player 
case. Nonetheless, some modifications are introduced and will be explained when suitable. 


Consider any player i € P. Given an observed trajectory fi, l € {1,2,....n;}, and conse- 
quently, the control trajectories u_, , of all other players, we can formulate the costs J;(&, ;) 
of player i generated by Č ı such that only variations of his own control trajectory u, , are 
taken into account, i.e. the costs are formulated as J; (u, ü; p 63). Local approximations of 


the observed trajectory a are considered which arise from the aforementioned variations of 
u, , while the other players’ controls u_,, remain unchanged. Hence, we approximate the 


cost function J; (u. úp oi) by means of a second-order Taylor series expansion around the 


observed controls u, , corresponding to the trajectory d. This results in 


x 
Ji (u.ö.,..6:) x Ji (i, #416) + (u, z it, 9:,1(9:) 

N a, i (6.44) 
+ 2 (u, = it, G;,1(9;) (u, zu it, > 
where g;,(0;) € R™ke and Gi1(0;) c Rmkexmike denote the first and second derivative of 
Ji with respect to u,, respectively, i.e. 


= dJi 

9;,,(0:) := TA (6.45) 
—ilu,=u,] 

2 d’J; 

G; 0; = 6.46 

Kerr (6.46) 


In the following, g; ,(0;) and Gi,1(0;) are written as g; , and Gi, respectively, for brevity. 


By reformulating (6.24) using the Taylor series based approximation (6.44) of the cost function 
and considering that the observed trajectory {7 is (with fixed 0;) uniquely defined by the 
controls 4, , with given u_, , and the initial state x, the probability density function can be 


evaluated at Č q using the relation 
eJi( čili px®,0:) 


— = - 7 
J e Jil ulti ),6:) du. 
—oo = 


dim(ie ; 1) 


-1gt @ 1g. ~ \3 
el We det (Öri) (2a) z (6.47) 
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This leads to the log-likelihood function 


In (£0; | D})= )) -iania E: sin (det(Gi,1)) - dim (u,,) mem) (6.48) 


l=1 


which can be used for the MLEs stated in Theorems 6.1 and 6.2. The detailed calculation steps 
are provided in Section B.5 of the Appendix.?" 


Therefore, in order to evaluate (6.24), the first derivative Jil and the second derivative Ğiı 
are needed. Their calculation is explained in the following. 


6.6.2 Evaluation of the Log-Likelihood Function 


By applying the chain rule, the first and second derivatives of the cost function are given 


by 


T 
911 = Vu Ji + (Vu,x) VrJilu,=ü,, (6.49) 
x=X 
ee AR 
G; 1 = Vu,u,Ji + (vax) VxxJiVu, X + Vu,u,X xı Veh + 2Vu,xJiVu,X u,=U; | (6.50) 
x=X, 


where Vy, J; and Vx J; denote the partial derivatives of J; with respect to u, and x, respec- 
tively.?! Likewise, Vu,u Jis VxxJi and Vu,xJi represent second-order partial derivatives of J; 
with respect to u, and x. The partial derivative V, x is defined analogously. The term Vu u, X 
is used with some abuse of notation to represent a third-order tensor such that x, represents 
a 1-mode tensor multiplication [KB09, Section 2.5].*? 


In the following, we elaborate on the structure of the partial derivatives which form g; ; 
and Gi, q as given in (6.49) and (6.50), with the partial derivatives with respect to x as an 
example. With the assumed structure of the cost function (6.2), we obtain the first-order 
partial derivative 


Vidi =—[(¥x9;) Own --- (Vrd Oil wen] ER", (6.51) 


where, unless otherwise specified, Vy; denotes the partial derivative of &, with respect to 
x"). The second-order partial derivatives of the cost function VxxJi, Vu,u,Ji and Vu,xJi are 


40 
41 


Note that (6.44) and (6.48) yield equalities in the case of quadratic cost functions. 

The last term in (6.50) was neglected in [LK12]. Nevertheless, it can only be neglected if there are no features 
which depend on both x and uj, i.e. $;,(;)(x, ui) is equal to either ġ; (;)(x) or d;,( (ui) for all i € P and all 
j € {1, ..., Mi}. 


m 
42 For the 1-mode tensor multiplication we obtain Vu;u; XX1 VxJi = (1x4) Vu,u,X e Rixmikexmike which 


can be represented as a matrix of dimensions mjkg X mikg. 
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block diagonal matrices since the costs at time step k only depend on the states x‘) and 


k) 


controls u‘ at time step k. Therefore, we obtain 


M; Mi 
VexJi = blkdiag 2 Verdi Ss ne ae 2 Vexi NOD 
=1 x( =1 


k)=x(l) 


, (6.52) 


x(k)=x(KE) 


where blkdiag(-) denotes a block diagonal matrix. In this case, there are kg blocks of dimen- 
sion n X n. The other partial derivatives can be computed analogously to (6.51) and (6.52). 


The partial derivative V, x describes the sensitivity of x with respect to u; for all time steps 
as a consequence of the system dynamics. Since present actions are not influenced by future 
actions, the matrix 

Di = Vu,Xlu,-ü 


=i,l 


(6.53) 


x=X) 


is defined, where D; is a block upper triangular matrix. The blocks within D; are given by 


V cox) for k2 = kı +1 
pře% _ i Kek (kg-1,kı) (6.54) 
(urn) peto]  , for ke > ky +1 - 
k=k2-1 
0, else. 


The blocks peery, kı, k> € K have the dimension n x m; and represent the influence of the 
player i’s control at time step kz on the states at time step kı. These partial derivatives can 
be interpreted as part of the numerical solution of the initial value problem which approx- 
imates the next state. The matrix D; employs the partial derivatives with respect to u; in 
each time step for the whole corresponding time interval between two time steps. Contrary 
to this approach, a modification of the matrix D; is proposed here in order to improve the 
approximation. Inspired by the trapezoid method for solving initial value problems [Epp13, 
Section 6.5], the effect of u% at kz on x“) is approximated by means of 


> (kz2,k 1 
D ash) = 5 (Voy? + Vater) 


= ; (0 + pft) i 


$ 


(6.55) 


The modified matrix D;, which is built with the blocks pe analogously to D with (6.54), 
(ki) 


; on the interval of xk) until x(t) and 


43 


takes into account the effect of the control value u 
yields a better approximation of the system dynamics. 


43 This modification was applied in experimental work presented in [IEFH18]. 
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Contrary to the aforementioned partial derivatives, the term Vy „,x is a third-order tensor 
and does not exhibit a convenient structure for its computation. Therefore, following the 


recommendations in [LK12], this term is neglected in favor of more efficient calculations.** 


6.6.3 Algorithms 


The results presented in the previous sections are condensed in two algorithms for the so- 
lution of inverse dynamic games by means of MaxEnt IRL. The algorithms summarize the 
procedure for cost function identification in a dynamic game when an open-loop informa- 
tion structure or a feedback information structure (and corresponding Nash equilibria) lie at 
hand. The following Algorithm 3 corresponds to the open-loop case. 


Algorithm 3 IRL Method in Open-Loop Dynamic Games for Player i. 


Input: Observed trajectory set D, dynamics f, basis functions @;. 

Output: Computed player i cost function parameters 0). 
1: Determine the derivatives of the features V.®,, Vu;0;, Vex; and Vuju;P 
2: Determine the matrix D; with (6.55). 
3: Determine the first and second derivatives g; ; and Git evaluated at the trajectories Č ] 


i 


by means of (6.49) and (6.50), respectively. 
4: Calculate the MLE according to (6.29) using the log-likelihood function (6.48). 
5: return ĝ;. 


The next Algorithm 4 gives the necessary steps for solving inverse dynamic games with a 
feedback information structure based on MaxEnt IRL. 


Algorithm 4 IRL Method in Feedback Dynamic Games for Player i. 


Input: Observed trajectory set D, dynamics f, basis functions @;. 
Output: Computed player i cost function parameters 0;. 
1: Determine the system dynamics with respect to player i by means of (6.40) and the fea- 
tures according to (6.41). 
2: Determine the derivatives of the features Vx ;, Vu,®;, Vxx;, and Vuju;9;- 
3: Determine the matrix D; with (6.55). 
4: Determine the first and second derivatives g; ; and Gil evaluated at the trajectories Č i 
by means of (6.49) and (6.50), respectively. 
5: Calculate the MLE according to (6.42) using the log-likelihood function (6.48). 
6: return 0;. 


44 Neglecting this term does not have any effect for most problems. For example, this term is always zero for the 


broad class of nonlinear control-affine systems (3.11). 
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Remark 6.5: 


Step 1 and Step 2 of Algorithms 3 and 4, respectively, can also be calculated prior to the identifi- 
cation procedure since they are independent of the observed data. 


Remark 6.6: 


The methods shown in this chapter are formulated for a finite-horizon problem, i.e. kg € Nso in 
(6.2). However, all results can still be applied if the assumed underlying LQ dynamic game has 
an infinite horizon kg — ov. The presented method solely requires the availability of observed 
state trajectories x € R"K: and u, € R™K: where K; < œ (cf. Definition 6.1). For adequate 
results, [0; K;] should be a sufficiently representative interval of the complete time span [0, 09). 


6.7 Application to Inverse LQ Dynamic Games 


This section presents an exemplary application of IRL for solving inverse LQ dynamic games 
in order to illustrate the procedures presented in Algorithms 3 and 4. In the following, both 
inverse open-loop dynamic games and inverse feedback dynamic games are examined. 


6.7.1 Open-Loop 


Consider N-player LQ dynamic games according to Definition A.7. Therefore, each player 
applies his controls to a system described by the difference equation 


N 

(e+) _ AR (00) (X) 4) 

DIS Bee (6.56) 
j=l 


Furthermore, each player i € P selects an open-loop strategy ys? = ul’) (cf. Definition A.4) 
based on a quadratic cost function of the form 


iSe Nr 
Ji = = > ((« ) QO;x + (u! ) Riu; ) (6.57) 
k=1 

where Q; and Rj; < 0 are symmetric matrices.“ The cost function (6.57) does not include 
the terms which penalize the controls u‘ ,j + i of all other players (cf. (A.16)). This is due to 
the fact that these controls do not have any influence on the solution of open-loop dynamic 
games and therefore can be neglected. This follows e.g. from the necessary conditions for 
Nash equilibria given in Theorem A.1. 


45 The negative sign is considered in this chapter according to (6.2) and thus Rj; is negative definite instead of 
positive definite to ensure a meaningful problem. 
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In order to apply the results of the previous sections to linear-quadratic dynamic games, it 
is necessary to reformulate quadratic objective functions such that the structure in (6.2) is 
obtained. Furthermore, the partial derivatives of the states with respect to the controls have 
a particular structure in the case of linear system dynamics. These aspects will be examined 
and presented in the following. 


Features in LQ Open-Loop Dynamic Games 


The features in the vector ø; which correspond to the i(n? + n) non-redundant elements of 
the matrix Q; are given by 


f 1 
gi = 2x9, RM ce=1,...,n,r=1,...,c. (6.58) 


i,rc 2 
Similarly, for the (m? +m;) parameters ofthe symmetric matrix R;;, we obtain the features 


si 1 
pri, = = Sie CHL. mar aye. 0: (6.59) 


For r = c, the parameters which are multiplied with ger and pre correspond to the r- 


th diagonal entry of the matrix Q; and R;;, respectively. For the case where c + r, these 
parameters correspond to two times the off-diagonal (symmetric) entries Q; „e = Q; er and 
Rii,re = Rii,cr, respectively. 


System Dynamics 


The linear system dynamics lead to the relations 


Vex) = AP and Vox" = Bp. (6.60) 


Then, D; can be determined with (6.54) and (6.55). 


The following example illustrates the solution of an inverse dynamic game with MaxEnt IRL 
to identify cost function parameters: 


Example 6.1: 


Consider a two-player discrete-time dynamic game with system dynamics (6.56) defined by 
the matrices 


(k) _ 
Ap = 


1 a | (k) _ Br 


a at De W @.05 | , je{l,2,keK (6.61) 
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and the initial value x“) = [1 = These matrices correspond to a continuous-time 
double-integrator system (cf. Example 5.2) sampled with AT = 0.02 s. In addition, let the 
quadratic cost function of the players be given by (6.57), where 


4 1 10 1 
== , Rı=-1, = - , Ro=-1. 6.62 
Q: f | 11 Q, | 1 | 22 (6.62) 
Then, the features corresponding to the cost function of player i are given by 
9 __1/,m)’ g2: ~_ 1,0, 
Pan = T3 F ) » Gita = as a a, 
o: __1/(,@)’ gR 1EY Snn 
Jiz = -3 (x ) > = -3 (u! ) : 
The cost functions J; of player i can be rewritten as 
kg 
i i i Rii ; 
Ji == X a F CRON ee = Re = Oia Pi, » ie {1,2}. (6.64) 
k=1 
with the cost function parameters 
6,=6.=[4 2 3 ı]', 
6,=63;=[10 2 2 ı] 
Now we assume kg = 250 and use the coupled Riccati equations (3.60) to calculate the 


OLNE* and obtain the trajectory set ¢*. The state and control trajectories belonging to this set 
are corrupted by Gaussian white noise such that the resulting trajectories have a signal-to- 
noise ratio (SNR) of 30 dB. A total number of 30 realizations are generated, leading ton; = 30 
trajectories Ĝ, l € {1,... n}. These are used to evaluate the log-likelihood function (6.48), 
for which we compute the necessary partial derivatives. The partial derivative Vx J; is given 
by (6.51), where 


PORT) 
(Vx¢;) 6; = hie AR 2 cae (6.65) 
50,2% + 8;,(3)x 
27,02)*1 i,(3)*2 
Similarly, Vu, Ji € RE is determined by using the partial derivative 
k 
(Vabi) 0i = Ou. (6.66) 
For the second partial derivates we obtain 
: 9:1) 59; (2) 0ra) 20; (2) 
Ver) = bikdia (-| ADE RED esol ae 2 (6.67) 
= PV Gio ao 291,02) 91,03 


Vu,u,Ji = blkdiag (Ar). (6.68) 
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The MLE (6.29) is performed using a numerical optimization method, namely the Broyden- 
Fletcher-Goldfarb-Shannon (BFGS) method. We obtain, after normalizing with respect to 
0; (4) for a better comparability, the estimated parameters 


6, = [3.88 -2.22 2.98 1.00]" 


\ ” (6.69) 
ð = [10.19 -1.69 2.12 1.00] . 
Consider now the feature count 
LAL a k) (k k k kya]! 
b= 5 > [ey PP ay ay ay]. (6.70) 
k=1 


The feature count of the trajectory é generated by solving an LQ dynamic game with the 
estimated parameters (6.69) is given by 


ù = [9.88 -12.75 17.04 32.46 12.66]. (6.71) 


The mean feature count of observed trajectories is 


L = [9.88 -12.76 17.05 32.90 12.87], (6.72) 


suggesting, in consideration of (6.10), that the estimated parameters Ô; are different to the 
original parameters 0;, but lead to very similar costs. The original trajectory (* and the 
estimated é are depicted in Figure 6.2, showing that the identified parameters are able to 
explain the observed behavior. 


1 


kAT ins 


Figure 6.2: Observed trajectories and trajectories following from the estimated parameters of the LQ dynamic 
game in Example 6.1 
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6.7.2 Feedback Case 


Consider now a LQ dynamic game where players choose their feedback strategies (cf. Defi- 
nition A.5) based on a quadratic cost function. Since we consider a feedback (or MPS) infor- 
mation pattern, the general quadratic cost functions are given by 


kg N 
1 
h=-, = xT xh) 4 > Ru |, iep, (6.73) 
k=1 j=l 


and the resulting feedback strategies are given by (6.39). This relation can be used to obtain 
system dynamics from the point of view of player i given by 


N 
k+1) _ alk) (k (k) (k) (k) p(k) (k 
x(k) = A, x! + Bot; - X BK, x” 


j=l 
j#i 
RN k) (k) (k) er 
k 
AD) - X Bp, Ki" |x + BY) ul 
= 
jr 


=: AS) x® + BO u. 


As described in Section 6.5, inverse feedback dynamic games can be solved by exploiting 
the knowledge of the strategies yP . For the case of LQ dynamic games this means that the 
feedback matrices KO ie P,k € K are given. 


Remark 6.7: 


In the typical case that Ke i € P,k € K are not known, it is possible to assume an infinite 
horizon, i.e. kg — œ and estimate a constant feedback law which approximates the relationship 
between the controls and the states (cf. Section 5.4.2). In the case of an infinite-horizon inverse 
LQ dynamic game, then the estimation can be effectively done by means of (5.36). 


46 The continuous-time equations were used as the considered time step AT = 0.02 s allows a quasi-continuous 


analysis instead of the use of discrete-time equations for determining Nash equilibria. The interested reader is 
referred to Section A.5 of the Appendix where references on discrete-time Riccati equations are given. 


47 If the limit of the Riccati matrix P 9 for (k = kg — œ) exists, then it corresponds to a FNE for the infinite- 


horizon dynamic game. In general, other FNE solutions may also exist which are not necessarily related to the 
aforementioned solution [BO99, P. 290]. 
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Features in LQ Feedback Dynamic Games 


By using the known feedback control matrices Kk? *, the quadratic cost function (6.73) of 
player i can be rewritten as 


kg N 
1 
Ji= > xT OQ x + eK ORR N + a Ra” g (6.75) 
k=1 j=1 


jti 


The features corresponding to the entries of Q; und R;; are identical to the open-loop case 
(cf. (6.58) und (6.59)). In the feedback case, we further have the features corresponding to the 
entries of R;; which are given by 
Rij 1 k)* k)» 
ie -5(&; EUR ee Be are (6.76) 
where (K x0), denotes the r-th entry of the vector K” xW, Similar to the matrices 
Q,, and R;;, the main diagonal elements of Rj; correspond to parameters which weight the 


R;; 


”,’ =1,...,mj;. For the case where c + r, these parameters correspond to two 


features ¢; 7> 


times the off-diagonal (symmetric) entries R;j,rc = Rij,cr, respectively. 


System Dynamics 


The linear system dynamics lead to the relations 
Val) = A9, and Vox) — B®.. (6.77) 


Then, D; can be computed with (6.54) and (6.55). 


Example 6.2: 


Consider a two-player discrete-time dynamic game with the system dynamics (6.61), the 
initial value x) = [1 -1] "and cost functions of the form (6.73) with the cost function 


matrices 
8 0 
= a R =]; R =; 
Q; f | 11 12 
(6.78) 
10 
= » Ro =1, Roy = 0.3. 
Q, | j 22 21 


The LQ dynamic game leads to feedback strategies 


(k) 

(k)* _ „,(k)* — | p(k)* (k)x | [X 

u =y; (x)= K | fo . (6.79) 
2 
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We assume that K; = [IK E is not known and is approximated by a constant feed- 


back law K; to be identified, as mentioned in Remark 6.7. We obtain \|Kix —u;|| < 0.02 for 
alli = {1,2}. The approximation of the time-variant control matrices K; by means of the 
constant matrices K; is shown in Figure 6.3. 


3 
Dill ge eee nn as en nT eee ee i, RER | 
N 
10 Ka Ko 
km ky 
S 0.5 P-- m I M ee 
ol Ka) E Ko i= 
alias koa) 777 kz (2) == 
0.5 
0 1 2 3 4 5 


KAT ins 


) 


Figure 6.3: Nash equilibrium feedback matrices K (k * and their approximation by means of constant feedback 


matrices K ; in Example 6.2 


The features corresponding to the cost function of player i are given by: 


0 __1{.w” „©: LLV 
Pii "5 (x ) > din > (x ) 


R @\ Ru (k) (ky)? en 
Pin = -3 (u! ) > Pi = 2 (ke + Kj (ay 
The cost functions J;, i € {1,2} , can be rewritten as 
ke k 
i i Rii ij E A 5 
Ji= -), [0.06% = CRON + Obi + | LAE {1,2},74), (681) 
k=1 


where the cost function parameters are given by 


6,=[8 2 1 ı]', 
%=[1 4 1 03]. 


The calculated FNE trajectory ¢* is used to identify cost function parameters which explain 
it. However, this time the exact FNE trajectory (* and one single demonstration, i.e. ny = 1, 
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are used. Using the MLE (6.42) which is determined again with the BFGS method, we obtain 
the cost function parameters 


6, = [7.67 0.148 1.00 2.26], 


: 2 (6.82) 
6, = |-1.44 2.47 1.00 0.72] . 
Similar to last example, we consider the extended feature count 
ke 
Peed Q Q R R R Ra |" 
H= 2 | Du 9a Anm fru Pit A : (6.83) 
k=1 


for both the observed trajectory ¢* and the trajectory ¢ corresponding to the parameters (6.82), 
obtaining 


je = [10.44 15.92 1.34 10.37 10.41 1.34]" (6.84) 


and 


je = [10.44 15.93 1.34 10.37 10.37 1.34] , (6.85) 


and indicating that the identified parameters indeed approximate the observed trajectory 
adequately (cf. Example 6.1). 


6.8 Method Limitations 


Some potential limitations of the presented mehods shall be discussed before concluding this 
chapter. The introduced IRL-based inverse dynamic game methods can cope with truncated 
trajectories in [0, K;] with K; < Kg as long as these represent the complete trajectories ade- 
quately (cf. Remark 6.6). Small values of K; compared to Kg may deteriorate the results, i.e. 
the results improve the closer K; is to Kr. 


Noise-corrupted trajectories can also represent an issue since the approach indirectly at- 
tempts to equalize the feature count values of observed trajectories with the ones which 
would arise from the probability density function with identified parameters. On the other 
hand, equalizing feature count values may lead to a greater robustness in case the features, i.e. 
the basis functions, are not specified correctly. The effects of these issues on the identification 
results will be examined in Chapter 7. 


Finally, a further possible detriment can arise if the available trajectories do not constitute a 
Nash equilibrium. The method is based on the probability density function (5.1) which in- 
cludes the implicit assumption that each player’s decision was not directly affected by the 
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choice of the other players’ controls, a sufficient condition of which is given by the avail- 
ability of trajectories representing a Nash equilibrium. In addition, the method for feedback 
information structures leverages the availability of feedback control laws. If the control laws 
describe the functional relationship between states and controls, then the modified system 
dynamics still reflect the actions of the other players. Therefore, the IRL methods have the 
potential of being robust to at least mild deviations from the Nash equilibrium. Indeed, the 
basis of the presented results is Assumption 6.2, which does not demand that the observed 
trajectories are exactly equal to a deterministic result of the dynamic game with cost function 
parameters 0}. This allows for the estimation of cost function parameters 6; from trajecto- 
ries which represent and resemble Nash equilibrium trajectories, but may deviate from this 
optimality. 


6.9 Conclusion 


In this chapter, IRL was considered as a means to solve inverse problems in dynamic games. 
The principle of maximum entropy was applied to the dynamic game scenario and the ob- 
tained results were used to derive probability density functions to model the origin of ob- 
served dynamic game trajectories. Based on these, a maximum-likelihood estimation of the 
cost function parameters was proposed for the case when players apply Nash equilibrium 
strategies. Both open-loop and feedback strategies were regarded. In addition, the unbiased- 
ness of this maximum-likelihood estimation was proved under typical IRL assumptions. The 
results of this chapter lay the theoretical foundation for the application of MaxEnt IRL for 
identifying cost function parameters of players in a dynamic game. Finally, solutions of in- 
verse linear-quadratic dynamic games were shown to illustrate the presented methods and 
their applicability. 


After this last chapter presenting theoretical results on inverse dynamic games and their 
solution, the following chapters present a comparison between different method classes in 
both simulations and a real application. 


7 Simulations 


In the previous chapters, inverse problems in dynamic game theory were introduced and 
two main classes of methods were proposed for their solution, namely the residual-based 
IOC method and an IRL-based approach. These classes of methods are different from a theo- 
retical and conceptual point of view given their contrasting origins in automatic control and 
computer science. This chapter aims at presenting the capabilities of both classes of meth- 
ods and comparing them by using different test scenarios in simulations. In this way, their 
strengths and weaknesses shall be examined. Moreover, the IOC and IRL methods are sys- 
tematically compared with a Direct Bilevel (DB) approach which is based on the solution of 
a forward dynamic game in each iteration (see Section 2.1.1). 


This chapter starts with a mathematical description of the DB approach used for comparison 
to the new inverse dynamic game methods. Afterwards, the considered scenarios are intro- 
duced before explaining the general evaluation procedure applied in this chapter, as well as 
the metrics used for comparison. Then, the simulation results are presented and discussed. 
These results include an evaluation of the methods’ robustness to measurement noise and 
errors in the basis function vectors. After shortly analyzing the computation times of the 
methods, the chapter ends with conclusions based on the obtained insights. 


7.1 Direct Bilevel Approach 


The Direct Bilevel (DB) approach considered in this chapter is a direct extension of the 
method introduced in [MTL10] (see also Section 2.1.1), which was recently formulated in 
[MFP17a]. It aims to determine cost function parameters 0 = (01, ..., On) such that the cor- 
responding Nash equilibrium trajectories approximate the observed state and control trajec- 
tories. For this objective, the following optimization problem can be formulated: 


i N 
min Jpg = J I|xo(t) - x(t)||? + > ||wo, (t) - u,(t)||? dt, (7.1) 
0 Fl 


where x(t) and ug ;(t) denote Nash equilibrium trajectories resulting from cost functions 
with parameters 0. The objective functional Jpg provides a natural squared-error metric 
between candidate state and control trajectories and the observed Nash equilibrium state 
x(t) and control trajectories u;(t). Note that if the observed trajectories correspond to a Nash 
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equilibrium with cost function parameters 6” € O, then the optimization problem is solved for 
any @ which also belongs to the solution set © according to the equivalence of cost functions 
(cf. Section B.2 in the Appendix) which imply identical Nash equilibrium trajectories. Some 
details need to be considered for practical implementation of this approach. These are given 
in Section B.6 in the Appendix. 


7.2 Simulation Scenarios 


In this chapter, two main simulation scenarios are considered: 


1. a non-linear open-loop dynamic game with two players controlling a ball-on-beam 
system 


2. a generic LQ feedback dynamic game with three players 


In the first scenario, the ball-on-beam is chosen as a dynamic system. It is a well-known 
benchmark system in control engineering since it poses a challenging stabilization prob- 
lem which is representative of the difficulties generated by growing nonlinearities [HSK92, 
BSLK97]. This scenario shall serve to show the solution of inverse dynamic games with open- 
loop strategies. 


The second scenario consists of a LQ dynamic game with feedback strategies. Considering 
the class of LQ dynamic games allows for an analysis with the tools developed in Chapter 5. 
Furthermore, in order to increase the complexity of the LQ dynamic game, a generic dynamic 
game is considered where three players influence a system by means of two control variables 
each. This scenario is used for the examination of inverse feedback dynamic games. 


For each scenario, one IOC-based method, one IRL method and a DB approach shall be com- 
pared. The performance comparison is first done with assumed perfect observations of the 
Nash trajectories. Nevertheless, an evaluation of the robustness of all methods to noise in the 
observations is also presented. 


7.3 Evaluation Method 


In the following, the evaluation method is presented. After describing the general steps con- 
stituting the whole evaluation process, the metrics used for the comparison are introduced. 
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7.3.1 General Steps 


The evaluation procedure used in this chapter is summarized in Figure 7.1 and shall be ex- 
plained in the following. For the simulation environment, a cost function structure defined 
by a linear combination of basis functions according to (4.2) is assumed. Therefore, it is first 
necessary to define a basis function vector ¢; and a parameter vector 6; for each player i € P. 
These cost functions are used to calculate the Nash equilibrium trajectories of the states 
x*(t) and the controls u7(t).48 For the case where perfect observations are assumed, the ob- 
servations x(t) and u;(t) correspond to the calculated Nash equilibrium trajectories x(t) and 
u;(t). Otherwise, Gaussian white noise e* and e”! is added to the Nash equilibrium state 
trajectories and control trajectories to form the observations, respectively. The generated 
observations x(t) and u(t) simulate dynamic game data which is measured and results from 
the interaction between the players. Based on these observations, in the inverse dynamic 
game step, one of the inverse dynamic game methods is applied to obtain estimations of the 
cost function parameters Ô; for all players i € P. At this point, the analysis of the identi- 
fication results may be conducted based on the parameter deviation, i.e. the comparison 
of the estimated cost function parameters with the ground truth. Nevertheless, particularly 
for the robustness evaluation, it will be examined whether potentially inexact identification 
of the cost function parameters has a considerable impact on the capability to approximate 
the observations. For these cases, identified trajectories x(t) and u;(t) are determined. This 
is done by calculating the Nash equilibrium again, yet this time based on the estimated 
parameters 6; of all players. By comparing the identified trajectories with the ground truth 
trajectories, it is possible to evaluate if the estimated parameters can describe the observed 
outcome of the dynamic game despite a potential deviation from the real parameters. We 
determine the trajectory deviation by calculating the metrics 5*, 5“, and A® which are 
presented in the next section. 


7.3.2 Evaluation Metrics 


As previously mentioned, the results of the inverse dynamic game methods are evaluated with 
respect to the quality of the cost function parameter identification. Furthermore, the approx- 
imation of the observed trajectories by means of the trajectories of the identified model are 
also assessed. For these two objectives, two different metrics are used which are introduced 
in the following. 


48 All simulated Nash equilibrium trajectories x*(t) and u(t) are calculated using a continuous-time formulation 


of the dynamic game using the different theorems from Section 3.6, depending on the information structure and 
strategy types. The IRL-based methods, which were developed considering a discrete-time formulation, shall 
be given equivalent system dynamics corresponding to the selected time step AT as shown in Examples 6.1 and 
6.2. Furthermore, one single trajectory set will be used, i.e. ny = 1 for the IRL methods. 
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Calculate equilibrium 


X(t), u(t) 


an Trajectory deviation 
deviation 


Ae 5*, 5" 
Figure 7.1: Evaluation procedure for simulation results 


Cost Function Parameters 


Since identification of the cost function parameters is only possible up to a scaling constant, 
the comparison is done after a normalization process. The ground truth parameters 0; gr and 
the identified parameters 6; are normalized with respect to an arbitrary parameter. In this 
case and without loss of generality, the last entry of the vector 0; is chosen. This is done for 


all players i € P. Therefore, for the ground truth normalized parameter vectors O% (norm) and 
the normalized estimated parameter vectors 8; (norm) of player i, we have 
{97}p > {ô:}p 
{10} norm } = x and {0;, norm } SR > 
,(norm) JP {0 }m, (norm) Sp om, (7.2) 


Vp € {1,..., Mi}, 
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where {0;}, denotes the p-th entry of the parameter vector 6;.*” The parameter {6;}m, is 
therefore the last entry of the vector 0;. By using the normalized parameters, the relative 
parameter error is defined as 


59 = {0; (norm) hp 


I Vpe{1,...,M;}. (7.3) 


{ 0; (norm) bp 


The comparison of the parameters is done by means of the absolute value of the relative error 
of the parameters 


Ae = |a- a8, AP € [0,0). (7.4) 


Therefore, the closer the absolute values of the relative error A8 are to zero, the stronger the 
similarity is between identified and ground truth parameters. The mean and maximum value 
of AG will be considered. These are denoted with A? and A? respectively. 


p, mean ‘p, max?’ 


Comparison of Trajectories 


Before introducing the considered metrics for comparing trajectories, it is important to note 
that in the simulations, trajectories are available in the form of a series of K; data points 


described by the set 


Ti = {tk € [0,T] |1 <k < Ki ^0 < tk <T}. (7.5) 


In the following, K; = K is set for all i € P to ease the comparison between ground truth and 
estimated trajectories. The estimated trajectories x(t) and u;(t), i € P, are the ones which 
arise from the solution of the dynamic game with the estimated cost function parameters 6;. 
The different state and control trajectories may differ in maximal amplitude, which hinders 
a direct comparison between them. In order to be able to compare the error measures of all 
trajectories, it is reasonable to normalize each of them with respect to their respective max- 
imum value. Therefore, we consider the normalized sum of absolute trajectory errors 
(NSAE), which in case of the state error, is defined as 


K 
3 1 ~(k) a(k) ; 
Xj — = 
ou = g ) I % | ‚Jefl,...,n}, (7.6) 
max; |X; k=1 


# The notation {0; }p is equivalent to the previously introduced 0; (p). These are used interchangeably in favor 


of better readability. 
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where x") = x‘ denotes the k-th data point of the state x;. For systems with more than one 
state, the sum of NSAEs of the state trajectories 


ô” = y 6% (7.7) 


j=1 


is considered. Similarly, the NSAE of the controls of player i is defined as 


mi K 
y 1 3 ~(k a(k ; 
ou = — Tw] ara JE {1,...,mj;}. (7.8) 


The overall sum of NSAEs of the control trajectories is given by 
N 
54 = z "i, (7.9) 
i=l 


In the following, the error measures (7.7) and (7.9) will be used for trajectory comparison. 


7.4 Inverse Open-Loop Dynamic Games 


In this section, different classes of inverse dynamic game methods for identifying cost func- 
tion parameters corresponding to an open-loop Nash equilibrium are evaluated and com- 
pared. The methods are 


e the residual-based inverse differential game method of Section 4.3, 
+ the method of Section 6.4 based on IRL, 


e the direct bilevel approach presented in Section 7.1 for the open-loop case, detailed in 
Section B.6. 


These are abbreviated and referred to as IOC, IRL and DB methods, respectively. 


7.4.1 Preliminaries 


The considered system is a ball-on-beam system which was extended such the system is 
controlled by two players simultaneously instead of one. The task is to balance a ball in the 
middle of the beam. 
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Figure 7.2: Ball-on-beam system 


System Dynamics 


The ball-on-beam system is shown schematically in Fig. 7.2. Here, a, denotes the angle of 
the beam towards the horizontal. In addition, (sx, sy) and (sx, sy) denote the positions of the 
ball in the earth-fixed and beam-fixed coordinate systems, respectively, both centered at the 
beam’s center of rotation. Both players are allowed to interact with the system by applying 
a torque u;(t) = M;(t), i € {1,2}, with respect to the beam’s rotational axis. Let the system 
state be defined as 

x(t) = [se(t) 5) art) a(t]. (7.10) 


Then, the system dynamics are described by the nonlinear differential equation (cf. [BVBB14]) 


X2 
mpr? (x1x3—ge sin(xs)) 
morga Ie sin(a) 
x= Pere’, (7.11) 
X4 


—2mMpX1X2X4—Mp Je X1 COS(X3)+Uy +u2 


mpx? +0w 


where the time dependence of the states and controls was dropped for a better readability. 
The variable ge is the gravitational constant, ©,, is the inertia of the beam and r,, m, and ©, 
are the radius, mass and inertia of the ball, respectively. The parameter values are given in 
Table 7.1. 


Table 7.1: Parameters of the ball-on-beam system used for simulation 


Je mp rb Op Ow 


9.81m/s? 0.02kg 25mm 5-10°°kgm? 0.667 kgm? 
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The inertia of the beam was calculated assuming an equally distributed mass my = 1.3kg, a 
width dẹ = 0.01 m and a length ly = 2m. 


Cost Functions and Data Generation 


Each player acts based on an individual cost function of the form (4.2), where the basis func- 
tion vector is given by 


Bee) el (7.12) 


This feature vector describes both players’ individual preferences to zero the ball’s displace- 
ment from the center ofthe beam, its velocity, the beam’s angle and angular velocity, respec- 
tively. Furthermore, it represents the desire to keep their individual torques small. In the 
following, units are neglected as all quantities are given in SI units. To model the players’ 
behavior by means of cost functions, let the ground truth parameters be given by 


6, = (20° 1 1 1.2] and ef 1 10 1 1]. (7.13) 


In this way, the first player focuses on bringing the ball to the center of the beam whereas 
the second player mainly focuses on bringing the beam to a horizontal position (see state 
definition in (7.10)). 


For the calculate equilibrium step, the system dynamics and cost functions with ground 
truth parameters are used to solve for open-loop Nash equilibrium trajectories by applying 
Pontryagin’s minimum principle and then solving the resulting two-point boundary value 
problem, where the initial state 


x(0) = [0.5 0 0 0], (7.14) 


was used. The solution leads to trajectories x*(t) and u(t) corresponding to the open-loop 
Nash equilibrium (OLNE). Further details on the calculation are given in Section B.4 of the 
Appendix. The equilibrium state is illustrated in Figure 7.3, where the trajectories of the ball 
position and beam angle, i.e. of the states x;(t) and x3(t) are depicted. The applied torques of 
each player, i.e. the controls u(t) and uz(t) are also shown. The different preferences of the 
players modeled by the cost function parameters in (7.13) can be recognized. Player 1 applies 
a positive torque such that the ball is moved towards the zero position, whereas player 2 
counteracts this action since his focus is to regulate the beam angle towards zero. 


7.4.2 Noisefree Case 


The inverse methods are first tested under the assumption that the observed trajectories cor- 
respond exactly to the OLNE trajectories generated by the ground truth cost function param- 
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Figure 7.3: Open-loop Nash equilibrium trajectories of the ball-on-beam system 


eters 0}. This represents an ideal condition to analyze the extent up to which the real param- 
eters 0; can be obtained. The cost function parameter values are given with a precision of 2 
decimal values. More precision is not needed since, as it will be shown later, differences of 
less order of magnitude barely have an effect on the corresponding trajectories. Nevertheless, 
the parameter errors A8 are calculated with the highest possible precision. 


Inverse Optimal Control Based Method 


The trajectories of the open-loop Nash equilibrium are used to determine the parameters 0; 
of each player by means of Algorithm 1. The solution of the RDE appearing in the method 
was calculated by means of a numerical solver of MATLAB (ode45). 


The estimated parameters are”? 


6, = [19.99 1.00 1.00 1.00 2.00] 


a (7.15) 
6, =| 1.01 1.00 10.00 1.00 1.00]. 
which lead to a mean parameter error AÈ ien = 0.16% and a maximum parameter error 
AR ieh = 0.76%. The NSAE of the states is ô% = 0.0271. The NSAE of the controls is 
ô” = 0.025. 


5° For the presented inverse open-loop dynamic game results, the parameter vectors 0;, Vi € P were multiplied 


with a constant factor c € R* such that the last entry corresponds to the ground truth, ie. Ô; 5) = 0; 6) 
Vi € P. This was done in favor of higher clearness in the comparison. 
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Inverse Reinforcement Learning Based Method 


In order to solve the inverse dynamic game problem, Algorithm 3 was applied. The op- 
timization problem corresponding to the MLE (6.29) was solved with the MATLAB solver 
fminunc, using a BFGS Quasi-Newton method. The estimated parameters are 


6; = [19.51 0.95 0.73 0.77 2.00] 


x (7.16) 
6,=| 1.04 1.01 9.99 1.02 1.00]. 
We obtain a mean parameter error AP eai = 8.1% and a maximum parameter error AB, max = 


27.0%. The NSAE of the states is 6* = 0.664. The NSAE of the controls is 6% = 0.554. The 
parameter error is bigger than the one generated by the IOC approach. The NSAE values are 
also higher than the ones corresponding to the IOC based identification. 


Direct Bilevel Approach 


For this method, the optimization problem (7.1) was solved using the procedure in Section 
B.6 with an interior-point method of MATLAB’s fmincon solver. 


The estimated parameters are 


6; = [20.11 0.89 3.91 0.85 2.00] Gat 
: 7417 
0z = [1.14 1.01 10.13 1.09 1.00]. 


The mean parameter error A8, mean = 42.9% and a maximum parameter error A? ax = 290.9%. 
The NSAE of the states is 6% = 1.4322. The NSAE of the controls is ô* = 0.122. The parameter 


error is bigger than the one generated by both the IOC and IRL approaches. 


Comparison 


The following Table 7.2 summarizes the results of the parameter identification with all meth- 
ods. In addition, the identified parameters Ô; of all methods are used to generate OLNE 
trajectories x(t) and u;(t). Both the original and identified trajectories of the controls as well 
as the ball position and beam angle (states x, and x3, respectively) are depicted in Figure 7.4. 
While the parameter errors of the identification with IRL and the DB approach are higher 
than the ones corresponding to the IOC method, they do not have a big impact on the tra- 
jectory approximation in this setup. The OLNE of all identified cost functions is practically 
identical to the original OLNE trajectories. The differences are imperceptible even though 
there is a slight difference in the estimation accuracy by all methods. This also confirms 
that the presented parameter precision of two decimal values is sufficient for an adequate 


comparison. 
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Table 7.2: Ground truth and cost function parameters of the nonlinear OL differential game identified with all meth- 
ods using noiseless trajectories 


6, 02 


GT [20.00 1.00 1.00 1.00 2.00] [1.00 1.00 10.00 1.00 1.00] 
IOC [19.99 0.99 1.00 0.99 a k 1.00 9.99 1.00 a 


IRL [19.51 0.95 0.73 0.77 2.00] [1.04 1.01 9.99 1.02 1.00 
DB [20.11 0.89 3.91 0.85 2.00| [1.14 1.01 10.13 1.09 1.00] 
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Figure 7.4: Trajectories resulting from the nonlinear inverse dynamic game solutions with IOC, IRL and DB methods 


7.4.3 Robustness to Measurement Noise 


In practice, measurements of the states and controls corresponding to a dynamic game may 
not be ideal. For example, the measurements may be affected by noise, which can be detri- 
mental for the identification of cost function parameters. Therefore, the results of the inverse 
dynamic game methods should ideally be robust to measurement noise. In order to evalu- 
ate this property for the considered open-loop methods, Gaussian white noise is artificially 
added to the state and control trajectories. Hence, the new measurements which are used for 
identification of cost function parameters are given by 


X(t) = x3(t) + ež, Yz € {1,...,n}, (7.18) 
Üi z(t) = u} (t) + Ej, Yz € {1,... mi}, VI EP. (7.19) 
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The noise ež and e, was chosen in such a way that all signals have a particular signal-to- 
noise ratio (SNR). Different SNR levels from 20 dB to 40 dB were considered for trajectory 
generation. In order to examine the consistency of the results, 100 samples of Gaussian white 
noise are generated for each of the considered SNR levels such that we obtain the trajectories 
Le: s € {1,...,100} (cf. Definition 6.2). Figure 7.5 shows examples of noise-corrupted Nash 
equilibrium trajectories with different SNR values. The generated noisy trajectories are used 
to identify cost function parameters with all methods. Therefore, for each of the methods, we 
obtain 100 sets of identified parameters 6, s € {1,...,100}. In turn, each of these is used to 
compute corresponding OLNE trajectories denoted by (,,s € {1, ..., 100}. The mean over all 
100 values of the identified parameters of each player, denoted by 6: mean is computed for the 
following analysis. Moreover, the comparison of the estimated parameters and trajectories 
with the original ones is assessed with the mean of the NSAE (defined in (7.7), (7.8) and (7.9)) 


over all 100 trajectories. These are denoted by ô% san Omean and 54 


ean» respectively. Similarly, 


the maximum and mean parameter errors over all 100 results, denoted by AP, and Ad ans 
are considered (cf. (7.4)). 


Inverse Optimal Control 


The mean values of the identified cost function parameters are given in Table 7.3, where 
the noisefree case is listed for comparison and is denoted by an infinite SNR. The parameter 
error increases considerably with the presence of noise. Even with a SNR value of 30 dB 
which implies a rather low magnitude of the noise, the parameters deviate significantly from 
the ground truth. In particular, from this SNR value on, the parameter Ê; 3) becomes negative 
which implies a reward of the deviations from zero, instead of a penalty as originally stated. 
This trend is confirmed by the mean values of the parameter and trajectory errors which are 
summarized in Table 7.4. The table shows very high errors for an SNR value equal to 30 dB 
or below. 


Table 7.3: Mean values of the cost function parameters of the inverse nonlinear OL dynamic game which were 
identified with the IOC method 


SNR in dB 0, mean 02, mean 
20 32.83 3.71 -29.72 10.29 2.00 52.57 12.11 -115.97 38.37 1.00 
25 24.19 1.90 9.47 4.08 2.00 16.51 4.42 -30.00 12.44 1.00 
30 21.33 1.31 -2.78 2.01 2.00 6.02 2.12 -3.12 4.71 1.00 
35 20.40 1.10 -0.13 1.30 2.00 2.58 1.35 5.87 2.16 1.00 
40 20.14 1.03 0.64 1.10 2.00 1.50 1.11 8.77 1.36 1.00 


co 19.99 0.99 1.00 0.99 2.00 1.01 1.00 9.99 1.00 1.00 
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Figure 7.5: Noise-corrupted open-loop Nash equilibrium trajectories of the two-player dynamic game with the non- 


linear ball-on-beam system 
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Table 7.4: Mean parameter errors and NSAE of trajectories obtained with the IOC method 


SNRindB Sn Oman Oman Ad Adean 


20 69.522 129.948 221.314 9206.2 49.28 
25 20.706 69.119 100.339 47.59 6.56 
30 10.449 38.274 55.584 10.05 2.10 
35 4.123 15.534 22.563 4.45 0.68 
40 1.469 5.177 7.519 2.57 0.25 
co 0.027 0.010 0.0147 0.01 0.002 


Inverse Reinforcement Learning 


The mean values of the identified cost function parameters are given in Table 7.5. The order 
of magnitude of the parameters is similar for all SNR values, but the results are also negatively 
affected by lower SNR values. For an SNR value of 20 dB, the parameter Ês) of player 1 be- 
comes slightly negative, leading to a reward of the deviation of x3 from zero. The mean values 
of the errors listed in Table 7.6 are moderately low compared to the IOC results, especially 
the mean parameter error and the mean NSAE of the states. 


Table 7.5: Mean values of the cost function parameters identified with the IRL method 


SNR in dB 61, mean 02, mean 
20 20.62 1.19 -2.58 1.79 2.00 1.53 1.16 7.03 1.58 1.00 
25 19.85 0.97 0.79 0.99 2.00 1.23 1.06 9.13 1.20 1.00 
30 19.60 0.94 1.16 0.81 2.00 1.09 1.02 9.63 1.08 1.00 
35 19.53 093 117 0.76 2.00 1.05 1.01 9.88 1.04 1.00 
40 19.51 0.92 1.41 1.72 2.00 1.05 1.01 9.98 1.03 1.00 
co 19.51 095 073 077 2.00 104 1.01 9.99 1.02 1.00 


Table 7.6: Parameter errors and NSAE obtained with the IRL method 


SNR in dB Oean Oean Oirean KS ie AP 
20 4.808 10.915 15.854 23.05 1.07 
25 2.522 4.256 6.182 10.24 0.39 
30 1.345 2.088 3.033 6.83 0.20 
35 0.920 1.109 1.610 5.42 0.14 
40 0.724 0.591 0.855 2.97 0.15 


co 0.664 0.227 0.327 0.27 0.08 
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Direct Bilevel Approach 


The mean values of the identified cost function parameters are given in Table 7.7. The identi- 
fied parameters are very similar for all SNR values and no clear SNR-dependent trend can be 
recognized. Almost all parameters are very similar to the ground truth. Only the parameter 
64,3) of the first player could not be recovered exactly. The mean values of the errors listed in 
Table 7.8 show that the parameter and trajectory error overall do increase with smaller SNR 
values. However, even for the lowest SNR value of 20 dB, the errors, especially the NSAE of 
the controls, are considerably low. 


Table 7.7: Mean values of the identified cost function parameters obtained from noisy trajectories using the DB 


method 
SNR in dB 61, mean 62, mean 

20 20.12 0.86 3.16 0.91 2.00 1.06 1.00 9.97 1.04 1.00 
25 20.13 0.90 3.03 0.93 2.00 1.09 1.00 10.06 1.06 1.00 
30 20.05 0.94 2.34 0.94 2.00 1.05 1.00 10.03 1.04 1.00 
35 20.02 0.92 2.19 0.92 2.00 1.01 1.00 9.96 1.01 1.00 
40 20.03 0.96 1.99 0.95 2.00 1.04 1.00 10.04 1.03 1.00 
co 20.11 0.89 391 0.85 2.00 1.14 1.01 1013 1.09 1.00 


Table 7.8: Mean parameter errors and NSAE obtained from noisy trajectories using the DB method 


SNRindB 6% On Dane: “AY A8 


mean max mean 
20 4.424 0.239 0.355 12.203 0.547 
25 2.887 0.144 0.214 12.871 0.451 
30 1.872 0.087 0.128 8.743 0.312 
35 1.329 0.056 0.081 5.420 0.259 
40 0.881 0.037 0.053 7.034 0.192 
co 1.432 0.050 0.072 2.909 0.429 


Comparison 


The results of cost function identification with noisy measurements are now compared. The 
mean values of the parameter error corresponding to the SNR values of 20 dB to 40 dB are 
illustrated in Figure 7.6. In a similar way, Figure 7.7 contrasts the mean values of the NSAE 
of the states and controls. 


Figure 7.6 shows that the IOC approach outperforms the IRL and DB methods in the case 
of perfect observations of the Nash equilibrium trajectories, but its parameter estimation 
becomes notoriously worse as the SNR values become smaller. In contrast, both the IRL and 
DB method yield similar results across all SNR values and demonstrate being less affected 
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Figure 7.7: Comparison of trajectory errors of identification for all SNR values and all methods 


by measurement noise. The DB method is slightly better than the IRL approach only for 
the 20 dB case. A similar trend is observed in Figure 7.7. Nevertheless, it is noticeable that 
the differences in the parameter estimation can lead to big dissimilarities in the mean NSAE. 
The superiority of IRL and the DB method in terms of robustness to measurement noise is 
confirmed. However, it can be observed that the DB approach yields the lowest NSAE of the 
controls. 


In order to obtain a better insight into the quality of the trajectory approximation, the mean 
values of the identified parameters with each method, i.e. the parameters in Tables 7.3, 7.5 and 
7.7, are used to generate corresponding model state and control trajectories. Figure 7.8 shows 
an example for an SNR value of 30 dB. The IRL and DB methods yield very similar results. 
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The IOC approach is able to explain the state trajectories adequately, but fails to reproduce 
the course of the control trajectories. For SNR values lower than 30 dB, the control trajectory 
approximation by the IRL method starts to deteriorate while the DB approach maintains its 
robustness. Plots of this comparison for all SNR values can be found in Section E.1.1 of the 
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Figure 7.8: Observed trajectories and estimations based on mean identification results of all methods, SNR = 30 dB 


7.4.4 Robustness to a Basis Function Mismatch 


Especially in practical applications, it cannot be assured that the observed trajectories con- 
stitute Nash equilibrium trajectories generated by the considered basis functions g,. In order 
to give a first evaluation of the limits of the presented methods, a mismatch of the original 
ground truth (GT) basis functions and the ones used in the inverse dynamic game methods is 
regarded in this section. The following analysis utilizes the noisefree trajectories generated 
by the parameters 0} and 03, as given in Section 7.4.1, for identification. However, for both 
the inverse dynamic game step and the subsequent forward solution to obtain estimated tra- 
jectories x(t), u,(t), and w(t) (cf. Figure 7.1), four different basis function vectors shall be 
considered which differ from the original ones. These are given in Table 7.9. 


The choice is motivated by the control task and the ground truth parametrization (cf. (7.13)). 
The basis functions x? and x? corresponding to the ball velocity and the beam angle velocity 
are both weakly weighted by 0; (2) and 0; (4), respectively. Therefore, case I neglects these 
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Table 7.9: Considered cases in the basis function mismatch analysis of inverse open-loop dynamic games 


Case g; 
GT be x? x? x? WW] 
I [x? x ur] 
II [xg ur] 
m [e u2] 
IV [x? x x? x? ea] 


basis functions to evaluate their significance for identification. Cases II and III disregard 
one additional basis function, either x? or xA 


angle, respectively. Finally, case IV represents a situation where one of the basis functions is 


corresponding to the ball position and beam 


incorrectly specified. 


The basis functions are assumed as different from the ground truth and hence, the parameters 
are not comparable. Therefore, only the NSAE of the trajectories shall be considered for the 
evaluation. The NSAE arising from identification with each method is given in Table 7.10 
for each case. For case I we observe a low NSAE of the states and a higher NSAE of the 
controls. Cases II and III lead to worse results in terms of the state trajectory approximation. 
Lastly, for case IV only the IRL method yields low NSAE values for the states, whereas the DB 
method can only approximate the control trajectories adequately. The observed trajectories 
and the estimated trajectories are exemplarily shown for cases I and IV in Figures 7.9 and 
7.10. Additional plots describing the results of the other cases can be found in Section E.1.2 


of the Appendix. 
Table 7.10: NSAE in case of basis function mismatch 
Case Method ö* or 6 ô" 
IOC 17.395 35.338 50.936 86.274 
I IRL 14.993 41.280 59.513 100.793 
DB 129.579 10.909 15.040 25.949 
IOC 339.288 17.472 23.152 40.624 
II IRL 315.659 13.063 21.286 34.348 
DB 338.988 12.635 20.357 32.992 
IOC 117.505 15.356 22.504 37.859 
III IRL 119.463 14.479 19.814 34.292 
DB 128.401 15.429 20.357 35.786 
IOC 521.423 11036.171 15928.605 26964.776 
IV IRL 15.394 47.481 68.439 115.920 


DB 124.605 4.994 4.595 9.589 
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7.4.5 Discussion of Inverse Open-Loop Dynamic Game Results 


By comparing the results of both methods based on noisefree trajectories, it is recognizable 
that the method based on IOC offers the best results in terms of parameter accuracy. This also 
leads to a better performance considering the approximation of the ground truth trajectories. 
Nevertheless, even though the IRL method and the DB approach exhibit a lower parameter 
approximation accuracy, both are still able to explain the Nash equilibrium trajectories. While 
there is computationally a minor difference between their trajectory approximation errors, it 
is so low that it is imperceptible, as shown by Figure 7.4. 
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Figure 7.9: Inverse open-loop dynamic game results for all methods, basis function mismatch case I. 


The differences in the parameter identification results can be explained by the different char- 
acteristics of each of the methods. All methods are based on the solution of an optimization 
problem. In the case of the IOC approach, the parameters which exactly fulfill the conditions 
for Nash equilibria are sought. Since the observations are perfect, i.e. they correspond to 
an exact Nash equilibrium, the corresponding cost function parameters can be found with 
great precision. The IRL approach is based on the maximization of a likelihood function 
which indirectly considers the requirement of matching the cost function values of the Nash 
equilibrium trajectories. The slight deviation to the true parameters arise given the fact that a 
sufficient match of trajectories, which correlates to a peak in the likelihood function, may not 
require a precise estimation of parameters. Finally, the DB approach similarly searches for 
parameters such that the deviation between the costs of observed and estimated trajectories 
is minimal. This also potentially does not require an exact estimation of parameters. 
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Figure 7.10: Inverse open-loop dynamic game results for all methods, basis function mismatch case IV. 


Having discussed these differences in the noisefree case, it is possible to find similar expla- 
nations for the results of identification in the presence of measurement noise in the observed 
Nash equilibrium trajectories. In this case, we observe that the IRL method and the DB ap- 
proach are more robust towards measurement noise. Even up to SNR values of 20 dB and 
25 dB, cost function parameters can be found which explain the observed trajectories. This 
can also be explained by the different principles each method is based on. The probabilistic 
formulation of the inverse dynamic game problem in the IRL-based method with the indirect 
requirement of matching trajectory costs leads to a higher robustness to noise. On the con- 
trary, the IOC approach is strongly affected by measurement noise. The parameter deviations 
of the IOC approach especially lead to a poor approximation of the control trajectories. The 
approximation of the state trajectories is not strongly affected by the parameters deviations 


due to higher trajectory noise.*! 


Finally, the analysis of basis function mismatch indicates that all methods are mildly robust 
towards a small mismatch of the basis function vectors, especially regarding the state trajec- 
tory approximation. All methods yield greater errors if an originally relevant basis function 
(e.g. x? and x} in the example) is neglected. The results suggest that the task can, to some 
extend, still be described by the other basis functions with a corresponding adequate param- 
eterization which compensates the missing basis functions. However, this possibility may 


°1 Similar results were reported in [MTFP16], where a one-player inverse optimal control problem was similarly 


solved by leveraging the minimum principle and where only the state were corrupted with noise in the evalua- 
tion. 
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depend on the real parametrization of the basis functions. This means that a missing basis 
function which was weighted by a high value of the corresponding parameter cannot be com- 
pensated with other basis functions. In addition, a misspecified basis function as in case IV 
can affect the results of all methods considerably, especially for the IOC method. This is due 
to the fact that the basis function x2 is not appropriate for the task at hand which consists 
of regulating all states to zero. The other methods, IRL and DB, are less affected since they, 
either directly or indirectly, take the deviation between trajectories into consideration. This 
is further illustrated by Table 7.11 where the parameters identified by each method in case IV 
are listed. The table indicates that the IRL and DB methods correctly estimate the parameter 
9;,(2)—the one which corresponds to x,—as a value which has to be at least close to zero such 
that trajectories similar to the observed ones can be obtained. 


Table 7.11: Identified cost function parameters for basis function mismatch case IV 
9, 02 


GT 20.00 1.00 1.00 1.00 2.00 1.00 1.00 10.00 1.00 1.00 
IOC 18.22 23.90 19.40 -1.30 2.00 -5.56 -7.37 16.73 -3.07 1.00 
IRL 18.78 0.00 16.10 -0.62 2.00 155 0.02 33.17 -0.20 1.00 
DB 29.42 0.17 199.12 43.19 2.00 8.11 0.01 95.59 18.82 1.00 


7.5 Inverse Feedback Dynamic Games 


After comparing inverse dynamic game methods for identification in open-loop dynamic 
games, this section is devoted to an evaluation of inverse feedback dynamic games in a Nash 
equilibrium, i.e. the players applied linear feedback strategies which led to a FNE. Analo- 
gously to last section, one method of each class is analyzed and compared in the following. 
In particular, 


e the inverse LQ differential game method of Section 5.4, 
+ the method of Section 6.5 based on IRL, 


e the direct bilevel approach presented in Section 7.1 for the feedback case, detailed in 
Section B.6. 


Again, these are be abbreviated and referred to as IOC, IRL and DB methods, respectively. 


7.5.1 Preliminaries 


The following analysis is conducted by means of an infinite-horizon linear-quadratic dynamic 
game with the following system dynamics and cost functions. 
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System Dynamics 


The system is described by the differential equation 


3 
x(t) = Axt) + >> B;u;(t) (7.20) 
i=1 
with 
-8 -6 1 0 0 1 0 0 0 0 
1 0 2 1 0 0 0 0 1 0 
A = > B, = > Bp = > B; = t 
0 -2 0 1 0 0 1 0 0 1 
0 10-1 1 0 0 1 0 1 
Therefore, in this case, each player i has a control vector u; € R™ with m; = 2 to apply 
at each time t. The system (A, [Bi + B n]) is stabilizable and therefore, the existence of 


stabilizing linear feedback strategies of the form 
u;(t) = -K;xft), VieP 


is guaranteed [EBS00]. 


Cost Functions 


Each player i € P aims to minimize an individual quadratic performance index 


I = of x'(t)O;x(t) + u; (t)Riui(t) dt. (7.21) 


The ground truth parameters of the cost functions were set to 


Qï = diag(1,0.4,2,1), Rj, = diag(1, 1), 
Q3 = diag(1,0.6,1,2), R3, = diag(1, 1), (7.22) 
Q3 = diag(1,1,0.5,1), R3, = diag(1, 2). 


Using the ground truth cost function parameters, the feedback Nash equilibrium trajectories 
x*(t) and u*(t) were calculated by means of the coupled matrix Riccati equations [Eng05, 
Theorem 8.5]. The theorem allows to confirm the Nash character of the trajectories given 
the stability of the controlled system. The resulting Nash equilibruim feedback matrices are 
given by 
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xi [0012 0.123 0.114 0.318 
15 [0.066 0.028 -0.006 0.012]” 
0.004 -0.041 0.541 0.130 
R= hieis i Ve) 
197 0.130 0.644 
xi [0025 0.650 0.115 0.149 
3 = |0.020 0.132 0.384 0.301|' 


Properties of the Inverse LQ Dynamic Game 


Before solving the inverse LQ dynamic game, the LQ character of the problem allows for 
its analysis by means of the results of Chapter 5. We first use the results of Lemma 5.2 to 
determine the matrices M; € R®*° with (5.14) using the control matrices K;. Now consider 
the rank of M; and obtain rank(M;) = 6 for all i € P. By the results of Theorem 5.3, the 
necessary and sufficient conditions for a unique solution of the inverse LQ dynamic game up 
to a multiplying constant parameter are fulfilled. 


7.5.2 Noisefree Case 


The inverse dynamic game methods are first tested under ideal conditions, i.e. the observed 
trajectories are free of measurement noise and therefore correspond exactly to the FNE which 
arise out of the dynamic game consisting of the system dynamics (7.20) and cost functions 
(7.21) with ground truth parameters (7.22). Since both the IOC and IRL methods rely on the 
estimation of the Nash equilibrium feedback matrices, this is carried out for both players 
using a least-squares approach presented in Section 5.4.2 and the given trajectories x*(t), 
u;(t). The estimation yields very good results for K; as we obtain deviations where ||K; - 
K;|| < 10-4, i = {1, 2, 3}, from the original Nash feedback matrices. 


Inverse Optimal Control 


The inverse dynamic game is solved by determining the solution of the quadratic static opti- 
mization problem (5.33) using the estimated feedback matrices K;. The parameters in (7.22) 
are exactly identified exactly up to two decimal values and are therefore not explicitely given. 
The mean parameter error NE ean is 0.05% and the maximum parameter error A8 max İS 0.26%. 


The NSAE of the states is ö* = 0.002 while the NSAE of the controls is 6“ = 0.010. 
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Inverse Reinforcement Learning 


The IRL approach leads to identified cost function parameters which approximate the original 


ground truth paramters up to two decimal values. The mean parameter error is Nie = 
0.1% and the maximum parameter error is Ns = 0.6%. The NSAE of the states is ô% = 


0.019. The NSAE of the controls is 6” = 0.073. All errors are slightly bigger than the errors 
obtained with the IOC method. 


Direct Bilevel Approach 


ARTA = 0.43% and a maximum pa- 
rameter error of A? 


pmax = 3.85%. The NSAE of the states is 6* = 0.028 and the NSAE of the 
controls is 6“ = 0.151. The DB approach yields greater errors than both the IOC and IRL 
methods. 


The DB approach leads to a mean parameter error of A 


Comparison 


The following Tables 7.12 and 7.13 summarize the results of the parameter identification with 
all methods.°? 


Table 7.12: Ground truth and cost function matrices Q ; identified from noiseless trajectories with all methods 


Case Q; 


Q, 


Q; 


GT 


(1.00, 0.40, 2.00, 1.00) 


(1.00, 0.60, 1.00, 2.00) 


(1.00, 1.00, 0.50, 1.00) 


IOC (1.00,0.40,2.00,1.00) (1.00, 0.60, 1.00,2.00) (1.00, 1.00, 0.50, 1.00) 
IRL (1.00, 0.40, 2.00, 1.00) (1.00, 0.60, 1.00, 2.00) (0.99, 1.00, 0.50, 1.00) 
DB (1.00, 0.40, 2.00, 1.00) (1.04, 0.58, 1.00, 2.00) (0.99, 1.00, 0.50, 1.00) 


Table 7.13: Ground truth and cost function matrices R;; identified from noiseless trajectories with all methods 


Case Rj, (22) Ro, (22) R3,(22) 
GT 1.00 1.00 2.00 
IOC 1.00 1.00 2.00 
IRL 1.00 1.00 2.00 
DB 1.01 1.00 2.00 


Even though the metrics show that the DB leads to the highest mean and maximum parameter 
errors as well as the highest NSAE, thus suggesting a superiority of IOC and IRL in the quality 


52 All results were normalized with respect to the parameter R;,(11) for a better comparison. Therefore, this pa- 


rameter is not explicitely given in Table 7.13. 
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of the estimation, all errors are relatively small. The values in Tables 7.12 and 7.13 confirm 
that all methods lead to an excellent estimation of the cost function parameters. For the 
sake of completeness and in order to see potential differences in the approximation of the 
observed trajectories, we solve the LQ differential game with the estimated parameters and 
determine the corresponding FNE trajectories for all methods. The ground truth and model 
state trajectories are depicted in Figure 7.11. Likewise, the control trajectories are shown in 
Figure 7.12. All methods are able to perfectly approximate the observed trajectories. 
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Figure 7.11: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method 


7.5.3 Robustness to Measurement Noise 


This section presents simulation results on the influence of the presence of noise in the ob- 
served trajectories on the results of the inverse dynamic game methods. Similar to the eval- 
uation in Section 7.4.3 for the open-loop case, Gaussian white noise is added to the state tra- 
jectories and the control trajectories according to (7.18) and (7.19), respectively. Once more, 
the added noise is generated such that the corrupted trajectories have a particular SNR value. 
The considered SNR values range from 20 dB to 40 dB. 100 samples of Gaussian white noise 
were generated and therefore, the noisy trajectories ee s € {1,..., 100} (cf. Definition 6.2), 
are obtained for each of the different SNR values. Each one of these trajectories was used 
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Figure 7.12: Ground truth and estimated control trajectories of the inverse LQ feedback dynamic game with each 
method 


to identify cost function parameters. Therefore, we obtain for each method the parameter 
sets 0,,s € {1,...,100}. Each of the parameter sets can be used to determine corresponding 
FNE trajectories which are denoted by ¢;, s € {1,...,100}. Analogously to Section 7.4.3, the 


for the trajectory comparison as well as AP, and A? 


max mean fo r 


metrics Okean On and Omea 


parameter comparison are considered. 


Inverse Optimal Control 


The parameter and trajectory errors are given in Table 7.14. The errors increase moderately 
with lower values ofthe SNR. The worst case mean parameter error is 18.2%. It is noticeable 
that the NSAE of the control u(t) is always bigger than the NSAE of the other players’ 
controls. 


Table 7.14: Parameter errors and NSAE between ground truth trajectories and trajectories obtained with IOC from 
noisy trajectories 


SNR in dB Okean Wicca Geen Ores ie A 
20 4.380 8.082 4.454 4.837 0.977 0.182 
25 1.668 3.437 1.614 1.878 0.843 0.162 
30 0.714 1.168 0.687 0.842 0.379 0.080 
35 0.330 0.455 0.337 0.408 0.342 0.045 
40 0.173 0.257 0.182 0.233 0.299 0.017 


co 0.002 0.003 0.003 0.004 0.003 4.67 -107+ 
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Inverse Reinforcement Learning 


The error measures for each SNR value are given in Table 7.15. In this case, it can be observed 
again that the NSAE of the control u(t) is always bigger than the NSAE of the other players’ 
controls. The worst case mean parameter error is 11.4%. 


Table 7.15: Parameter errors and NSAE between ground truth trajectories and trajectories obtained with IRL from 
noisy trajectories 


SNRin dB  S%.a, Oman Ozean Oman AS, Ae 


mean max mean 
20 1.556 4.947 1.449 1.144 1.859 0.114 
25 0.693 2.136 0.717 0.579 1.327 0.061 
30 0.291 0.839 0.361 0.309 0.631 0.030 
35 0.158 0.475 0.183 0.181 0.467 0.018 
40 0.085 0.250 0.098 0.096 0.175 0.008 
co 0.019 0.045 0.011 0.017 0.006 0.001 


Direct Bilevel Approach 


Table 7.16 gives the resulting NSAE and the parameter errors. The trend of less accurate 
estimations of the control u;(t) is visible in this case as well. The worst case mean parameter 
error is 19.4%. 


Table 7.16: Parameter errors and NSAE between ground truth trajectories and trajectories obtained with the DB 
method from noisy 


trajectories 
SNR in dB On ceri Ômean Onean Onea WE N 
20 1.255 3.640 0.846 3.843 1.866 0.194 
25 0.699 1.829 0.504 2.094 0.647 0.093 
30 0.438 1.049 0.308 1.328 0.579 0.061 
35 0.266 0.678 0.192 0.746 9.615 0.054 
40 0.148 0.334 0.112 0.430 0.253 0.022 
co 0.028 0.062 0.026 0.063 0.039 0.004 
Comparison 


Figure 7.13 shows a comparison of the mean NSAE obtained with each method and for all 
SNR values. It is noticeable that the IRL approach leads to the least NSAE of the states for 
the SNR values 30 dB to 40 dB. For an SNR of 25 dB, the IRL method and the DB approach 
obtain almost the same results. Finally, for highly corrupted trajectories with an SNR of 20 
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dB, the DB approach offers the best results, closely followed by the IRL method. The IOC 
method leads for all SNR values to a higher state error than the other approaches. Similar 
results can be observed in the mean NSAE control errors djjian; i € {1, 2,3}. In this case, the 
IRL approach offers better results consistently across all SNR values. For little noise, i.e. for 
SNR values of 30 dB to 40 dB, the IOC method leads to better results than the DB approach. 
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Figure 7.13: Mean NSAE obtained with each method for all trajectory SNR values 


Regarding the parameter errors, Figure 7.14 shows that the least mean parameter error is 
obtained by the IRL method for all SNR values. However, the maximum parameter error 
does not show a clear trend, but suggests that the IOC method yields more consistent results, 
as the other methods have greater maximum parameter errors. The IOC method and the 
DB approach have similar mean parameter errors. However, by inspecting the maximum 
parameter error, it can be discerned that the IOC approach does not lead to great differences 
as the SNR value changes. On the contrary, the maximum parameter error of the IRL is 
always higher and varies considerably more with the exception of the case of an SNR value 
of 40 dB. The DB method results do not allow a particular interpretation as no clear trend 
can observed, except for the bigger error with less SNR which is common for all methods. 
Nevertheless, an outlier value can be observed for an SNR of 35 dB caused by an anomalously 
poor identification result. 


Once more, for a better understanding of these results, the mean values of the identified 
parameters were used to determine mean estimated Nash equilibrium trajectories. These pa- 
rameters are listed in the Appendix: Tables E.1 and E.2 correspond to the IOC method, Tables 
E.3 and E.4 to the IRL method and Tables E.5, E.6 show the results of the DB approach. The 
resulting estimated FNE trajectories are compared with the original noiseless trajectories in 
¢*. Figures 7.15 and 7.16 show this comparison for the FNE state and control trajectories, 
respectively, which were estimated from noisy observations with 20 dB. It is noticeable that, 
despite the low SNR, all methods lead to good approximations of the states and control tra- 
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Figure 7.14: Parameter error of identification for all SNR values and all methods 
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Figure 7.15: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 


method. The identification was conducted using noise-corrupted trajectories with SNR = 20 dB. 


jectories. In a detailed view of the results, there is a better agreement between the original 


trajectories and the estimated ones in the case of the state variables. Furthermore, we can 
observe that the DB method performs slightly better than the IRL and IOC methods. While 
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this minor difference are visible in this case, these are even tinier for greater SNR values. The 
corresponding figures are given in Section E.2.1 of the Appendix. 
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Figure 7.16: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted using noise-corrupted trajectories with SNR = 20 dB. 


7.5.4 Robustness to a Basis Function Mismatch 


This section presents an evaluation of the robustness of the inverse LQ dynamic game meth- 
ods to a mismatch in the basis functions, similar to the analysis conducted in Section 7.4.4 
for the open-loop case. The noisefree trajectories generated by the cost function matrices ©} 
and R}; 
dynamic game step and the subsequent forward solution to obtain estimated trajectories x(t) 
and u;(t), i € P, it shall be assumed that certain elements of the matrix Q; are neglected 


and therefore not identified. The considered cases are described in Table 7.17. These describe 


i,j € P, as given in Section 7.5.1 are used for identification. For both the inverse 


an increasing number of parameters of the diagonal matrix Q; which are neglected. Analo- 
gously to the open-loop case, only the NSAE of the trajectories shall be considered for the 
evaluation. The NSAE arising from identification with each method are given in Table 7.18. 
Similar error values can be observed for the cases I to III for all methods, with the DB method 
presenting slightly lower values. In turn, case IV shows a very high error for all methods. 
The observed trajectories and the estimated trajectories are exemplarily shown for case I in 
Figures 7.17 and 7.18. Additional plots describing the results of the other cases can be found 
in Section E.2.2 of the Appendix. 
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Table 7.17: Considered cases in the basis function mismatch analysis of inverse LQ feedback dynamic games 


Case 0; 

GT [Qan Qia2) Qie Qrao Ruan Riez] 
I [Qan Qr 9,63) 0 Rian Riegl] 
HI [Qan Qeg 0 0 Rian Riegl 

WM [Qia 0 0 0 Rian Rieg] 

IV [ 0 0 0 0 Ria Riel 


Table 7.18: NSAE in case of basis function mismatch in inverse LQ dynamic games 


Case Method ôx ô"! jue 63 ô" 

IOC 37.622 19.944 76.219 30.861 127.024 

I IRL 11.978 22.173 17.352 8.489 48.013 
DB 11.224 16.388 14.781 8.625 39.795 

IOC 35.289 39.714 94.681 29.867 164.262 

II IRL 15.659 21.337 28.352 17.377 67.065 
DB 13.497 20.699 27.027 21.121 68.848 

IOC 29.423 61.954 88.718 22.088 172.759 

II IRL 47.402 30.572 50.260 59.949 140.782 
DB 13.870 24.111 26.942 21.399 72.451 

IOC 438.410 49.741 72.837 100.270 222.848 

IV IRL 438.410 49.741 72.837 100.270 222.848 


DB 201.243 49.741 158.091 100.270 308.102 


7.5.5 Discussion of Inverse LO Dynamic Game Results 


The inverse LQ differential game was solved by means of an IOC based method, an IRL based 
method and the DB approach. All methods were shown to lead to good identification re- 
sults both in terms of trajectory approximation and parameter estimation. The IOC method 
presented the highest parameter estimation precision in the case of noiseless trajectories. 


The analysis with noise-corrupted trajectories demonstrated that the IRL based method offers 
the best results across all SNR values. Only for the mean NSAE of the states, the DB method 
is slightly better than IRL. The results indicate that the DB and IRL methods are more robust 
towards measurement noise than IOC. As for the parameter error, we observe that the mean 
parameter error reflects the fact that the IRL method performed the best with all SNR values. 
The higher robustness of the DB approach in low SNR regions compared to IOC can also 
be noticed. However, an interesting result of IOC is the lower variability in the maximum 
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Figure 7.17: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; (4,4) = 0 for i € {1,2} 
(case I). 


parameter error. This suggests that even though the DB approach and IRL performed better 
in the mean, they are not guaranteed to always lead to better results. 


Regarding the robustness to a basis function mismatch, the resuls of Table 7.18 show that 
the methods are fairly robust to a mismatch caused by the neglection of features. However, 
not including any basis function which penalizes the states (as in case IV) leads to major 
deviations of both states and controls with respect to the original trajectories. The origi- 
nal parameters describe a behavior which aims at regulating all states to zero and has to be 
considered in the choice of the basis functions. Similarly to the analysis of the effects of mea- 
surement noise on the results, it can be discerned that the IRL and DB method are slightly 
more robust than the IOC method in case of a basis function mismatch. Finally, it can also be 
noted that the control trajectory approximation is corrupted more than the state approxima- 
tion, especially for the IOC and IRL methods. In general, the approximations of the controls 
are affected more, independent of whether the perturbation lies in the basis functions or the 


trajectories. 
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Figure 7.18: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; (4,4) = 0 for i € {1, 2} 
(case I). 


Analogously to the open-loop case, the results of this section can be explained by the dif- 
ferent concepts behind each of the methods. IOC depends on the fact that the trajectories 
correspond to a feedback Nash equilbrium. The IRL method is based on the maximization 
of a likelihood function which indirectly includes the requirement of matching costs of the 
observed trajectories and therefore is more robust towards mild violations of the Nash equi- 
librium assumption generated by the measurement noise or by basis function errors. Finally, 
the objective function of the DB approach which explicitely considers the deviation between 
trajectories is responsible for its good results. 


7.6 Computation Time 


Before concluding on the observed results, the computation time of all approaches is briefly 
examined, as computational efficiency is an important issue towards the application of these 
methods for an online estimation of cost function parameters. The computational effort is ex- 
emplarily shown for the case with noisy trajectories with SNR = 25 dB to give an impression 
of the computational demands of each of the methods. Table 7.19 presents the computation 
times of the different methods in the case of an identification in an open-loop and feed- 
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back scenario.” The DB method yields the highest computation time, followed by the IRL 
and IOC methods. The DB method’s computation time in the open-loop nonlinear case is 
approximately 26% higher than the one corresponding to the linear-quadratic feedback dy- 
namic game. This can be explained by the fact that the first demands the repeated solution of 
a nonlinear dynamic game which is generally harder to solve than a linear-quadratic dynamic 
game. The IOC method is the fastest since it relies on the solution of a conventional RDE or 
a quadratic program, which can usually be efficiently solved with numerical techniques. Fi- 
nally, the IRL method stands inbetween. The conceptually abstract likelihood function and 
its convergence properties are hard to analyze. However, the fact that it consists of one single 
static optimization problem yields a great chance of being faster than the DB method. 


Table 7.19: Computation times for inverse dynamic games 


Tcpu ins 
Method OL FB 
IOC 4.2 0.087 
IRL 161.3 1060.2 


DB 2435.8 1805.1 


7.7 Conclusion 


In this chapter, a systematic comparison between IOC, IRL and DB methods for solving in- 
verse dynamic games was conducted. Both open-loop and feedback structures were consid- 
ered. Moreover, the robustness of the approaches with respect to the presence of noise in 
the observed trajectories was examined. In addition to the quality of cost function parameter 
identification, the capability of the identified cost functions to describe observed data was 
also assessed. 


In the open-loop case, the IOC method was shown to lead to the most accurate results in 
the parameter estimation ifthe observed trajectories correspond to a Nash equilibrium. Nev- 
ertheless, if the observations are noise-corrupted, the IOC method’s results deteriorate. The 
state trajectory approximation is still adequate, but the control trajectories deviate consider- 
ably from the ground truth. The IRL and DB methods showed a higher robustness to mea- 
surement noise and yield to similar results. Only in the lowest considered SNR value case, 
the DB method led to slightly better approximations. In addition, all methods show a slight 
robustness to missing relevant basis functions as long as the other ones are meaningful and 
related to the control task at hand. In case a non-adequate basis function is provided, only 
the IRL and DB methods are able to neglect it by setting its corresponding parameter to a 
value near zero. 


53 The used CPU was an Intel Xeon E5-2630 at 2.6 GHz with 32 GB of RAM. 
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As for the feedback case, a similar trend as in the open-loop case could be observed. Nev- 
ertheless, it can be stated that the magnitude of both parameter and NSAE for IOC and IRL 
methods is smaller than in the open-loop case. One possible reason is that the linear system 
dynamics allow for better identification, especially in the case of the IRL which relies on a 
dynamics linearization (which is nevertheless time-variant, i.e. it is computed in every time 
step). However, this may be best explained by the LS estimation of the feedback matrices K; 
which is done by means of the control and state trajectories. This estimation is, theoretically 
speaking, not bias-free. The noise has zero mean, but is applied to both the control and the 
state values. In spite of this fact, the estimation works well in practice such that a relatively 
accurate functional relationship between the states and the control is provided to the IOC 
and IRL methods. This is also reflected by the good results obtained by all methods in the 
analysis of basis function mismatch. 


To finish this chapter, the main findings are summarized as follows: 


e Approaches based on IOC offer the most precise parameter identification results in 
case of uncorrupted observations of Nash equilibrium trajectories. They are less ro- 
bust towards measurement noise than the other methods and may be affected by a 
significant mismatch in the basis functions, but are the least computationally expen- 
sive of all methods. The latter property indicates that this method class is the most 
appropriate for a potential online application. 


e Approaches based on IRL provide a good compromise between computation time and 
quality of identification. They are the most robust towards measurement noise among 
all tested methods. Moreover, they are more robust to non-adequate basis functions 
than the IOC method and yield similar results than the DB method in this case. 


The direct bilevel approach has been shown to lead to very good results and to be robust 
to noise and slight errors in the basis functions, but the computation time is greater than 
IOC methods (up to a factor of approximately 20 000) and IRL methods (up to a factor 
of approximately 15) and therefore is the least efficient among all methods. 


e The robustness of all methods to measurement noise and to errors in the basis function 
selection is higher for the state trajectory approximations. Especially for the IOC and 
IRL methods, the approximation of the controls is more sensitive to violations of the 
assumptions the methods are based on. 


After this analysis of inverse dynamic game methods in a simulation environment, the fol- 
lowing chapter presents a first application of inverse dynamic game methods with real ex- 
perimental data. 


8 Application to Shared Control Systems 


This chapter presents an application example for inverse dynamic games. The aim of this 
chapter is to provide a first evaluation of the applicability of inverse dynamic games to iden- 
tify cost functions in a real scenario. In the following, a shared control scenario between 
two humans is considered. Shared control stems from the field of human-machine coopera- 
tion. It usually describes a situation where humans and machines simultaneously control a 
dynamic system. Therefore, it has led to a rising number of applications including robot- 
assisted rehabilitation in medicine as well as all kinds of technical assistance systems for 
vehicle control or for various types of technical devices including construction machines, 
wheelchairs, etc. For the evaluations in this chapter, an experiment in which several pairs 
of subjects simultaneously control a steering system is employed. This scenario is modeled 
by means of a differential game such that cost functions describing the interaction of human 
pairs can be identified from measured data. The two method classes for inverse differential 
games presented in this thesis, IOC and IRL, shall be evaluated by means of this experiment. 
Furthermore, similar to Chapter 7, the results shall be compared to the results of applying 
the DB approach for identification. 


8.1 Experimental Setup 


The experimental setup which was used can be seen as a simplified scenario of the lateral 
control of a vehicle. This section presents all details concerning the hardware setup and the 
implementation of the haptic feedback. In the following, this setup will be referred to as the 
cooperative steering system.” 


The cooperative steering system consists of four main components: two active steering wheels, 
two monitors with visualization windows and a real-time processing unit of dSPACE. The 
steering wheels are equipped with an incremental encoder of 40000 increments per full ro- 
tation for measuring the steering angles with a sampling frequency of fs = 100 Hz. Further- 
more, they are active due to integrated motors which can apply a torque on each of them. The 


54 
55 


The reader is referred to [ACM* 18] for a formal definition of Shared Control and its multiple applications. 
The experiment described in this chapter has been also presented in the conference paper [IFH19], where the 
differential game model was shown to better explain cooperative steering behavior than an alternative state-of- 
the-art model (presented in [IEFH18]). 
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maximum torque of the motors is 15.6 Nm. One of the components of the motor torque is cal- 
culated such that the steering wheel has the dynamics of a spring-damper system. Therefore, 
the dynamics of the steering wheel j € {1,2} are described by means of the equation 


Osw,; Ëj) = M;lt) — dj@j(t) — cjp;(t), (8.1) 


with the spring constant cj, damping constant dj and the moment of inertia Osw, ; and where 
g;(t) and M; denote the steering wheel angle and the human input torque, respectively. The 
parameters of the steering wheels are given in Section F.1.1 of the Appendix. 


In the experiment, the two steering wheels are haptically coupled. This virtual coupling 
is implemented in a real-time environment with the dSPACE processor unit. This unit is 
also used to establish the communication between all components. The haptic coupling is 
effectuated by calculating the required torque Mc(t) such that the angular difference between 
the two steering wheels is reduced to zero. This is achieved by emulating a virtual spring- 
damper element between both steering wheels with an automatic controller. Therefore, with 
the haptic coupling, a further torque exists which influences the dynamics of each steering 
wheel, leading to the dynamics equation 


Osw,; Ëj) = Mj(t) — djo) - cj@j(t) + Melt). (8.2) 


The implementation of the controller was done in MATLAB/Simulink 2010b. Further 
details on this controller can be found in Section F.1.2 ofthe Appendix. 


A computer interacts with the real-time system and generates two separate visualization win- 
dows on two monitors in order to give visual feedback of the current steering wheel position 
to each participant. This visualization was implemented by means of OpenGL and includes 
a marker (green square) which moves horizontally in the window according to the value 
of the steering angle. The steering wheel value range which is mapped onto the screen is 
[-180°; 180°], where a positive angle corresponds to a counterclockwise rotation. A further 
element in the visualization window is the reference trajectory. The points which consti- 
tute the trajectory pass downwards through the window at a constant speed. A single point 
crosses the entire visualization window in 2 seconds. The vertical position of the marker is 
fixed at 75% of the window height. Figure 8.1 depicts all components of the experimental 
setup as well as an example of the visualization window and the black curtain (thick black 
line) which served to separate each subject’s area. 


8.2 Modeling 


The experiment consists of a shared control task, in which pairs of participants control the 
cooperative steering system simultaneously. The aim of the subjects is to follow the refer- 
ence trajectory shown on the monitor by means of their corresponding steering wheel. This 
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Figure 8.1: Hardware setup for the experiment 


scenario is modeled by means of a differential game such that the observed data can be used 
to identify cost functions of each subject which explain their cooperative behavior. In the fol- 
lowing, the differential game is formalized mathematically. Afterwards, the system dynamic 
equations and cost function structure are stated more precisely for the scenario at hand. 


8.2.1 Shared Control Modeling via Differential Games 


Consider two human players controlling a dynamic system 
x(t) = Ax(t) + Byu,(t) + Bzu2(t) (8.3) 


with x(0) = xo, where x(t) € R” represents the system states and u;(t) € R™: denotes the 
control trajectories of player i. In addition, suppose a reference signal is given, which is the 
output of the known linear reference model 


z(t) = Hzit). (8.4) 


Given that the framework of feedback control is the most suitable for modeling human motor 
control [TJ02, Tod04], it is assumed that the human players select a feedback strategy y; € TF? 
according to Definition 3.6. Furthermore, the cost function structure 


co 


Ji= f eroe + u;(t)' Riuilt) dt, ie {1,2} (8.5) 


0 
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is assumed for each player, where e(t) = x(t) — z(t). In this way, the cost function models 
the objective of both humans to track a given reference, i.e. minimize the error between the 
state and reference trajectories. 


While the cost function (8.5) is quadratic, it is not a standard quadratic cost function since 
the cost function matrix Q; is not penalizing the state variable x(t), but the state-reference 
deviation e(t). Therefore, the methods for inverse linear-quadratic dynamic games cannot be 
applied directly. Nevertheless, it is possible to introduce a new system state including both the 
states and the reference variables such that (8.5) is transformed into a standard quadratic cost 
function. This leads to extended system dynamics where the linearity property is maintained. 
In this way, we obtain a linear-quadratic differential game according to Definition 3.11. The 
details on these reformulations are presented in Section B.7 of the Appendix. 


8.2.2 Cooperative Steering System Dynamics 


To simplify the model of the cooperative steering system, an ideal coupling of the two steering 
wheels is assumed. This means that both steering wheels have the same angle p and angular 
velocity ¢. With this assumption, the dynamics of the system of coupled steering wheels are 
given by 


de Ce 


X(t) = a a x(t) + 


1 


Osum 
0 


ade 
u(t) + a u2(t) (8.6) 


where x(t) = [@(t) MO) and u;(t) = M,(t) is the steering torque of human i. The vari- 
able Osum denotes the sum of the moments of inertia of both steering wheels. All system 
parameters are given in Table 8.1. 


Table 8.1: Cooperative steering system model parameters 


Parameter Value Description 
Osum 0.094 kgm? Rotational inertia of the coupled steering wheels 
Ce 1.146 Nm/rad Spring constant 
de 0.859 Nm-s/rad Damping constant 


8.2.3 Cost Functions 

The cost function structure is given by (8.5). Furthermore, diagonal matrices Q; = diag(q\” ; q? ) 
are assumed such that off-diagonal parameters are neglected. This is a common procedure 
in optimal control theory since off-diagonal matrix elements represent mixed terms in the 
cost function which are usually not interpretable [BH75]. The state reference is given by 
z(t) = [ Pree(t) Gret(t)] representing the reference values for the steering angle velocity 
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and the steering angle, which is visible on the monitor. It is assumed that the participants 
do not aim to follow a particular reference trajectory of the steering velocity since none was 
specified, neither visually nor verbally. Conversely, the reference trajectory of the steering 
angle p,er(t) corresponds to the one visible on the monitor and is equal for both partici- 
pants. 


8.3 Data Acquisition and Preparation 


In order to apply inverse dynamic game methods, a set of state and control trajectories is 
needed. As mentioned previously in Section 8.1, a sensor for measuring the angle ;(t) of 
each steering wheel is available. The steering angle velocity @;(t) and the acceleration ö;(t) 
are determined offline by a numerical differentiation and a subsequent smoothing process via 
a cubic spline interpolation (MATLAB function csaps with parameter p = 0.99995). The 
steering torque of each human u;(t) = M;(t) is then calculated by means of (8.2), ie. the 
system dynamics equation of each steering wheel. Due to the ideal coupling of the steering 
wheels, the steering wheel angle g(t) and angular velocity @(t) of the cooperative steering 
system are set equal to the mean value of both steering wheel angles and velocities, respec- 
tively. 


8.4 Experimental Protocol 


Fifty-two subjects (age 25 + 2.27) participated in the experiment in pairs. They did not have 
the possibility to make any eye-contact and were told to refrain from speaking during the 
experiment. They were aware that they were completing the task with a partner. Each subject 
pair was told to track the reference trajectory as well as they could. 


Each pair of subjects completed an approximately two minutes long run which consisted of 


e An approximately one minute long initial part (P1) which allowed the participants to 
become familiar with the haptically coupled system, 


« A 4 seconds long middle part (P2) which was used for identification and validation, 


« A 32 seconds long final part (P3) which was not used for analysis. 


The first part P1 included splines and step functions as visible reference trajectories for 
the steering angle. On the other hand, P2 consisted of only step functions. Step functions 
were used for evaluation since these represent goal-oriented or point-to-point movements, 
also known as reaching movements. This kind of movements are often considered in stud- 
ies concerning human motor behavior both from a neuroscience and biology perspective 
[FH91, Kal09, KM11] as well as from a control theoretical perspective [ARARU*11, CS17]. 
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The reference trajectory of P2 describes 4 point-to-point movements defined by the fixed po- 
sitions (120°, 0°, -120°, 0°, 120°). Finally, P3 included similarly to P1 step functions as well as 
splines. The subjects were unaware of this scenario subdivision and all related details. 


8.5 Evaluation Procedure 


As described in Section 8.2.1, the shared control scenario is modeled as a linear-quadratic 
differential game with feedback strategies. Therefore, the methods for inverse feedback dy- 
namic games (the same as in Section 7.5) are applied for cost function identification. In the 
following, they are also referred to as the IOC, IRL and DB methods. All methods were given 
the same system dynamics and cost function structure. The data obtained from the middle 
part of the test run (P2) was used for estimating the cost function parameters of both partic- 
ipants with each of the aforementioned methods. 


Contrary to the simulations presented in Chapter 7, no ground truth cost function param- 
eters 0° = (91,05) are available in a real application. Therefore, the only way to evaluate 
the identification results is by using the estimated cost functions to generate estimated tra- 
jectories x(t), #,(f), and w(t) and compare them with the measured trajectories x(t), u(t) 
and u2(t). This comparison is done by means of the NSAE for states and controls introduced 
in Section 7.3.2. The 52 participants formed 26 pairs of subjects and therefore, 26 data sets 
were available for analysis. These 26 sets of trajectories lead each to an estimation of the cost 


function parameters. Therefore, we obtain the parameters ee s € {1,...,26} for each of the 
methods IOC, Pi Ji DB. Afterwards, each set of identified parameter vectors consisting of 
On 6) and ò® i is used to solve for the Nash equilibrium trajectories x(t), MUO) and 
a(t), s € {1,...,26}. This is done by applying the reformulations of Section B.7 to obtain a 
standard LQ differential game and using Theorem 3.7 afterwards. The Nash trajectories are 
compared to the observed trajectories x(t), u,(t) and u2(t) by computing the corresponding 
NSAE as described in Section 7.3.2. Figure 8.2 summarizes the evaluation procedure applied 
in this chapter. 


8.6 Results 


The NSAE of states and controls was calculated for all data sets and all corresponding iden- 
tification results. All values are given in the Section F.2 of the Appendix. Due to the small 


data set, the median values 6* of the errors are considered instead of the mean values. 


median 
The median values and the standard deviations ô% of the errors for all used inverse dynamic 
methods are given in Table 8.2. The statistical results are summarized and depicted in Fig- 


ure 8.3. 
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x(t), ui(t) 


Inverse Dynamic Game 


Calculate equilibrium 


X(t), u(t) 


Calculation of trajectory deviation 


é*, ô” 


Figure 8.2: Evaluation procedure for the identification in a real shared control scenario 


Table 8.2: Mean value and standard deviation of NSAE obtained from identification with IOC, IRL and DB methods 


5* su 

OF dian dsp ma dian Ösp 
IOC 127.429 58.632 166.578 52.904 
IRL 101.236 35.611 173.202 49.356 
DB 89.672 19.372 22.867 


143.952 


The first noticeable characteristic of the results is the considerably higher magnitude of the 
error compared to the magnitudes seen in Chapter 7. In general, it can be discerned that the 
DB approach led to smaller mean values and variances of errors than the IRL and IOC based 
approaches. The IRL method performed better than the IOC method in terms of the state 
trajectory approximation. Nevertheless, the mean values of the NSAE of the controls are 
very similar. The range and standard deviation of the errors shown in Figure 8.3 are smaller 
for the DB method compared to IOC and IRL based approaches. In order to test the statistical 
t°° was conducted on the data sets 
of ö*, ô”. The test results confirmed that all differences are statistically significant with a 
significance level of æ = 0.01. Nevertheless, the control errors of the IOC and IRL methods 
are an exception. The signed rank test confirmed that their difference is not statistically 
significant. Detailed results with p-values are provided in Section F.2.1 of the Appendix. 


significance of these errors, a Wilcoxon signed rank tes 


5° A Wilcoxon signed rank test (see e.g. [SC88]) is a statistical test where, contrary to more widespread statistical 


test methods as e.g. student’s t-test, it is not assumed that the data follows a normal distribution. This assumption 
was avoided here due to the relatively small data population. 
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Figure 8.3: Statistical results of the cost function identification in the experiment 


In order to further illustrate the identification results, the measured data and the estimated 
trajectories x(*)(t) and u(t) for some representative subject pairs s € {1,...,26} are shown 
in the following. Figure 8.4 shows the data and identification results of subject pair 1. This 
data set yielded the smallest error for all methods. It can be recognized that the states are 
approximated the best by the DB approach, followed by the IRL method. The control tra- 
jectories cannot be exactly described by the dynamic game with the estimated parameters ô 
any method. Only the qualitative course can be described and several changes in the torque 
cannot be accounted for. 


The following identification result in Figure 8.5 corresponds to subject pair 2. The DB and 
IRL method yield the best results regarding state trajectory approximation. Nevertheless, the 
error is higher than in the results shown in Figure 8.4. In the case of the control trajectories, it 
is noticeable that the IRL approach fails to identify the control actions of the first subject, but 
estimates the control of the second subject as higher. This leads to the same state trajectories 
as the DB approach. The estimation of a control trajectory as (nearly) a constant is an effect 
which was observed for some data sets, not only for the IRL method, but also for the IOC and 
DB method. This effect can be seen e.g. in the results of subject pair 22 depicted in Figure 8.6. 
The DB approach is able to describe the control trajectories better, but on the other hand, the 
IOC and IRL methods are able to approximate the state trajectories slightly better than the 
DB method for this data set. 


8.7 Computation Time 


Analogously to Chapter 7, the computation time required for the solution of inverse dynamic 
games is analyzed.” The mean of the computation times was calculated for each of the 


57 The used CPU was an Intel Core i7-6600U at 2.6 GHz with 12 GB of RAM. 
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Figure 8.4: Identification results of subject pair 1 


method classes considered. The values are listed in Table 8.3. It can be observed that the 
results of Section 7.7 are replicated. The DB approach needs the most computation time, 
followed by the IRL and IOC method. The IOC and IRL approaches need 0.01 % and 1.57 % 
of the DB method’s required computation time, respectively. 


Table 8.3: Mean computation time for identification of both cost functions of a subject pair in the cooperative steer- 
ing experiment. 


Method Tcpu 
IOC 0.04 s 


IRL 4.65 
DB 291.755 
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Figure 8.5: Identification results of subject pair 2 


8.8 Discussion 


This section is devoted to a discussion of the results of the previous sections. The results are 
analyzed and the limitations of the methods and the experiment are reviewed. 


Overall, it can be stated that the inverse feedback dynamic game method based on the DB 
approach performs better than its IRL and IOC based counterparts in terms of trajectory ap- 
proximation. This is shown by the mean values of the errors 638 mean < ÔRL mean < ÔiOC, mean 
of both states and controls in Table 8.2. Furthermore, the standard deviations 635 and ösn are 


the smallest for the DB approach, indicating that this method led to more consistent results. 


The better results ofthe DB approach can be similarly explained as in the simulation results of 
Chapter 7. The underlying optimization problem in the DB method directly minimizes the er- 
ror between observed and estimated trajectories. In turn, the IRL method does this indirectly 
by means of an implicit requirement included in the likelihood function. In a very different 
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approach, the IOC method aims to minimize the violation of Nash equilibrium conditions and 
does not consider the error between trajectories in the process. 


In general terms, the methods appear to be able to describe the state trajectories better than 
the control trajectories. However, there were several data sets for which the state trajectories 
could not be explained adequately by the cost functions with identified parameters, regardless 
of the selected inverse dynamic game method. The question arises as to which reasons this 
effect might have. 


One potential source of error is an inexact modeling of the cooperative steering system. In 
particular, the assumption of an ideal coupling of the steering wheels may have been too 
strong for the used system, such that the description by means of (8.6) is not accurate enough. 
It is conceivable that this inaccuracy is higher the more dynamic the interaction is, i.e. when 
the partners act very differently and change the direction of the torque very often. Besides 
this fact, the subject pairs were observed to have partially disobeyed the instructions of the 
experiment. For example, in Figure 8.6, the time span between 1 s and 2 s shows that player 
1 applied a torque contrary to the one which is needed to bring the steering angle towards 
the reference value. This behavior had to be compensated for by player 2. Such behavior 
contradicts the rationality implied by a model based on differential games and thus cannot 
be accounted for. 


Overall, the results suggest that the players may not act exactly optimally and thus the inter- 
action may sometimes not be exactly represented by a Nash equilibrium. If the trajectories do 
not represent a Nash equilibrium, then worse results of the IOC and IRL methods are poten- 
tially obtained, given the fact that they rely on the estimation of a Nash equilibrium control 
law from these trajectories. For example, the IOC method first calculates an estimation K i 
of the linear control law which best describes the relation between measured controls and 
states; afterwards, cost function parameters are determined which correspond to the identi- 
fied control matrix. However, these control matrices K; which are optimal in a least-squares 
sense (cf. Section 5.4.2) do not necessarily correspond to a Nash equilibrium. Consequently, 
the cost functions with parameters 6; describe a Nash equilibrium which is the "closest" to K; 
in the sense that the violation of the Riccati equations is minimal. To illustrate this, consider 
the value of the residual IIM;:ô;ll, where M; is calculated by means of the K = (Kı, ..., Kn) 
identified via the LS method (see (5.36)). This describes the extent up to which identified pa- 
rameters ĝ;, together with K;, violate the necessary and sufficient conditions for Nash equi- 
libria. Therefore, it can be seen as a measure of the "non-Nash" character of the estimated 
K°8. Figure 8.7 shows that some of the identified K are approximately a Nash equilibrium, but 
some others present less Nash character. In particular, the good results of Figure 8.4 can be 
associated to a low value of the residual. Nevertheless, it could be observed that the residual 
value does not allow forseeing the quality of the trajectory approximation results. 


58 Note that | |M;0; || # 0 is possible while | M, ô; || = 0. M; is calculated with K’ which arise from the solution 


of the differential game corresponding to the identified parameters Ô. The latter lead to a Nash equilibrium 
according to the necessary and sufficient conditions used for determining the trajectories a ;(¢) and x(t). 
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Figure 8.7: Residual values of the identified control law and parameters for all subject pairs. Here, the outlier 
|IM202|| = 44.84 for subject pair 22 is not depicted in favor of better visibility of the other values. 


Another problem arises if the estimated K; yields higher values of the objective function of 
the least-square estimation functional ||u; + K;x|| (cf. (5.36)), ie. the linear feedback is unable 
to reproduce the relationship between u;(t) and x(t). A consequence would be a detriment of 
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the approximation capabilities of the inverse dynamic game methods based on IOC and IRL 
since they rely on this feedback law estimation to include the influence of the other player’s 
controls on the system dynamics. 


Finally, the mean computation times presented in Table 8.3 show that the IOC method would 
be the most appropriate method in terms of a potential online application such that cost func- 
tion parameters are constantly updated as new data points are available. The IRL approach 
may also serve for such a purpose with more efficient coding. On the other hand, the com- 
putation time of the DB approach confirm that it is not suitable for an online application. 
Cost function parameters may change over time due to different effects, e.g. fatigue or even 
sudden events. These alterations cannot be quickly detected by the DB method, but rather 
by the alternative methods developed in this thesis. 


8.9 Concluding Remarks 


In this chapter, an application example for inverse dynamic games was presented. A cooper- 
ative steering experiment was conducted where pairs of subject interact haptically to cooper- 
atively complete a control task. The results indicate that it is possible to describe cooperative 
system behavior by means of dynamic games, and that inverse dynamic game methods can 
be used to identify cost functions which explain the observed behavior. 


The results showed the following insights: 


« All methods are influenced by dynamic system model inacurracy, irrational behavior 
with respect to the control task, and the violation of the assumption of Nash equilib- 
rium trajectories. The IOC and IRL methods are the most affected by this violation. 


« The IOC method is confirmed as the most promising method for the online estimation 
of cost function parameters in real applications due to computation times of fractions 
of a second. 


« The IRL method performs better than the IOC method but the estimation demands 
more computation time. It still is less computationally demanding than the DB method 
but has a lower performance. 


« The DB method is the most robust towards all kinds of perturbations, but at the cost 
of a high computational burden. In the evaluations conducted in this chapter, the com- 
putation time was over 60 times and 7000 times bigger than the ones achieved by the 
IRL and IOC methods, respectively. 


The system used for the experiment and its dynamic model resulted to be too inaccurate 
to make reliable conclusions concerning cooperative behavior of human in haptic interac- 
tion. The results of this experiment suggest that the assumption of a Nash equilibrium in 
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haptic interaction may be reasonable in certain situations. In order to give answers to these 
questions, which are also interesting for other scientific communities, more studies and ex- 
periments have to be conducted. Nevertheless, the methods presented in this thesis showed 


the potential of application to these purposes. 


9 Conclusion 


As technical systems become more intelligent, they are also required to be able to interact 
with other technical systems and humans. The theory of dynamic games provides a useful 
mathematical framework for describing the interaction between several players with possibly 
conflicting interests. A large body of work exists concerning the calculation of the outcome of 
the dynamic game from known objectives of all players. On the contrary, the inverse problem 
of dynamic games, which consists in finding the cost functions each player minimized which 
led to the observed behavior, has received limited attention. This thesis contributes to this line 
of research by developing methods for the solution of N-player inverse dynamic games with 
both open-loop and feedback structures and with two different classes of methods, assuming 
that the interaction between players led to an open-loop or a feedback Nash equilibrium. 
Following the line of a large number of studies in the identification of cost function in a single- 
player case, the structure of the cost functions is fixed by assuming a linear combination of 
basis functions such that the problem is reduced to finding cost function parameters for each 
player. In addition, the results give a substantial insight on the properties of inverse optimal 
control and inverse dynamic game problems. 


The first method class proposed in this thesis is given by a residual-based IOC method and 
exploits necessary and sufficient conditions for Nash equilibria which are based on control- 
theoretical techniques. In the open-loop case, the reformulations of these conditions allow 
to pose the problem of identifying cost function parameters as an unconstrained quadratic 
program. Furthermore, sufficient conditions are given to test for the uniqueness of the cost 
function parameters up to a multiplying constant. For a feedback structure, the use of the 
same techniques is possible. Nevertheless, the knowledge of the feedback law becomes nec- 
essary. Identifying the feedback law is feasible for the main class of dynamic games given by 
infinite-horizon linear-quadratic dynamic games with an infinite horizon. Therefore, the in- 
verse problem of dynamic games was thoroughly analyzed for this particular class of games. 
By exploiting the necessary and sufficient conditions for Nash equilibria given by algebraic 
Riccati equations, explicit solution sets describing all possible cost function parameters which 
correspond to the same Nash equilibrium were established. Furthermore, a quadratic pro- 
gram was formulated to efficiently find a solution of the inverse dynamic game. An analysis 
of the properties of this quadratic program yields necessary and sufficient conditions for the 
uniqueness of the inverse LQ dynamic game solutions. 


The second method class which was proposed is an IRL approach, where a probability density 
function is stated as a likelihood function which depends on the cost function parameters of 
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each player. The likelihood function, found by means of the principle of Maximum Entropy, 
implicitely includes the requirement that the expected costs of the trajectories sampled from 
a density function with the estimated parameters correspond to the costs of the observed 
trajectories. The cost function parameters are determined via a Maximum-Likelihood es- 
timation. For this approach, it was proved that by maximizing the likelihood function we 
obtain equal expected costs of trajectories generated by the probability density function with 
ground truth parameters and the one with the estimated parameters. 


Having proposed two major classes of inverse dynamic game methods for each of the two 
information structures considered, i.e. open-loop and feedback, a systematic evaluation was 
conducted where each method was tested using Nash equilibrium trajectories of a test sys- 
tem. Until now, such a study was missing in literature, even for the single-player case. For 
inverse dynamic games with open-loop strategies, a two-player game with a nonlinear ball- 
on-beam dynamic system was considered. The evaluation in the case of a feedback Nash 
equilibrium was done using a three-player linear-quadratic dynamic game. Both cases in- 
cluded a comparison of the performance of IOC and IRL based methods as well as a direct 
bilevel (DB) approach analogous to the widespread state-of-the-art single-player inverse dy- 
namic game method of Mombaur et al. [MTL10]. The main findings confirm previous evi- 
dence that bilevel methods generally need a high computational effort, since they demand the 
solution of several dynamic games, i.e. determining Nash equilibria from current candidate 
cost function parameters. The IOC method outperformed IRL and the DB method in the case 
of perfect measurements. However, it was shown that the DB and IRL methods are similar to 
each other and more robust towards measurement noise than IOC methods, since the results 
of the latter deteriorate with higher measurement noise. Nevertheless, if the measurement 
noise is low, IOC methods can yield even better results than the DB approach, as it could be 
observed that the IOC method needs between 0.005% and 0.01% of the DB method’s compu- 
tational time. In addition, the inverse dynamic game methods which exploit the estimation of 
the feedback Nash equilibrium control laws were shown to be more robust towards measure- 
ment noise. As for potential errors in the basis functions, the IRL method offers the ability of 
detecting irrelevant basis functions with less computational effort than the DB method. The 
IOC methods show a higher dependency on meaningful basis functions. 


Finally, an application example of cooperative system identification was presented, where 
the aim was to identify cost functions which explain cooperative behavior of humans while 
completing together a control task and interacting haptically in the process. The results 
confirmed the trends observed in the simulations, showing that the DB method is the most 
robust method, followed by IRL and IOC methods. Nevertheless, some data sets could not 
be described properly by any of the methods. The results indicate that an accurate dynamic 
system model is of utmost importance for the use of these methods. With a model which 
better describes the dynamic system both humans interact through, it is conceivable that 
the developed methods based on IRL and IOC yield a good performance with a reasonable 
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required computational time (of seconds or even milliseconds), thus allowing for their use in 
real applications where an online estimation of cost function parameters is of interest. 


To summarize, this thesis makes a contribution to the theory of inverse problems in optimal 
control and dynamic game theory. The results not only provide new methods for solving 
this class of problems, but also shed new light onto their properties. In particular, the novel 
necessary and sufficient conditions for unique solutions of inverse dynamic games, as well 
of the unbiasedness of the estimation in an IRL setting, are also valid for the single-player 
case. The methods open new possibilites for applications regarding the description of multi- 
agent or cooperative system behavior, e.g. for the identification of human behavior during 
the interaction with a machine or of biological systems in general, leading to the possibility 
of employing a learning-by-demonstration approach in a multi-agent setting. 


A Infinite Dynamic Games in Discrete Time 


This section gives an overview of the relevant definitions and theorems for discrete-time 
dynamic games which are considered in Chapter 6 of this thesis. The definitions and theorems 
are analogous to the ones in continuous time. Therefore, each of them has a corresponding 
counterpart which can be found in Chapter 3. The following selection is based on the books 
[BO99, HKZ12]. 


A.1 Basic Definitions 


A discrete-time dynamic game involves N players taking actions in several discrete time 
steps. Since their possible actions are infinite, typical description forms as payoff matrices 
or game trees are not possible (see e.g [BO99, Chapter 3]). Instead, the evolution of their 
decision process is described by means of a dynamic system in discrete-time which is defined 
as follows. 


Definition A.1 (Dynamic System in Discrete-Time State Space Representation) 


A dynamic system is defined by a difference equation and an initial condition given by 
x) = (xe, ay Jaen u) (A.1a) 
xD =x, (A.1b) 


where x) € R” and u" € R™ denote the system state vector and the control vector of 
player i at time step k € {1,2,..., kg} =: K, respectively. 


Each player i € P acts upon the system in Definition A.1 by applying a sequence of inputs 
or controls un , Vk € K which belongs to an (here infinite) action space U;. Analogously to 
the continuous-time case, each player decides on a particular strategy yP from the space Tj. 
The control decision is based on the information available to them which is represented by a 
set-valued function ns? . This function is generally defined for each player i € P and all time 
steps k € K as a subset of 

Tad U E uO ee. a (A.2) 


j=1,...,k 
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where yl (k) = h(x) denotes the observed values of the state x'*) oa, to a function 


nis ) . Consequently, the control value at step k results from y; (m ) = u u” y;€ 


Each player selects its strategy according to an individual stage-additive cost function of the 


J= Doan (x? jal a): (A3) 


To summarize, a definition of the discrete-time infinite dynamic game is given. 


form 


Definition A.2 (Non-Cooperative Discrete-Time Dynamic Game) 


A non-cooperative discrete-time dynamic game is defined by 
e A set of players P = {1,...,N} 
e A set K = {1, ..., kg} including the stages of the game 
e An infinite action set Ui, ie P 


e A set-valued function ne describing the state information of player i € P at time 
step k 


e A system given by Definition A.1 


e A set of stage-additive cost functions J = {], ..., JN}, ie P. 


The elements and the definition strongly resemble those introduced in Chapter 3. In fact, in 
system-theoretical terms, if a time difference between each level of play (e.g. k and k + 1) in 
a discrete-time dynamic game can be stated and this difference tends towards zero, the game 
may be considered an approximation of a corresponding continuous-time differential game 
(quasi-continuous analysis). Indeed, this fact was exploited in order to apply the IRL-based 
inverse dynamic game methods of Chapter 6 to continuous-time models, e.g. the physically 
interpretable model of the ball-on-beam system. Furthermore, this allows the comparison of 
the methods presented in this thesis. 


A.2 Information Structures 


In the following, a definition ofthe information structures analogous to the ones in Definition 
3.4 is given. 
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Definition A.3 (Information Structure of the Players in Discrete-Time Dynamic 
Games) 


The information structure of player i is said to be 
(i) open-loop (OL) pattern fae =x keK. 
(ii) memoryless perfect state (MPS) pattern ifn” = {x9 xl kEK. 


(iii) feedback (FB) pattern if n\” = {x}, k € K. 


A.3 Strategies 


Similar to Section 3.4, the following definitions describe open-loop and feedback strategies 
in discrete-time dynamic games. 


Definition A.4 (Open-Loop Strategy in Discrete-Time Dynamic Games) 
An open-loop strategy yP for player i € P selects a control action according to 


um =P (x), Yx ER keK. (A.4) 


The set of all such possible strategies is denoted by pot; 


Definition A.5 (Feedback Strategy in Discrete-Time Dynamic Games) 
An feedback strategy y® for player i € P selects a control action according to 


uP =y Pa), kek. (A.5) 


The set of all such possible strategies is denoted by T™®. 


A.4 Conditions for Nash Equilibria and Pareto Efficient 
Solutions in Discrete-Time Dynamic Games 


The definition of the solution concepts, i.e. Nash equilibrium, Stackelberg and Pareto efficient 
solutions, are identical to the ones given in Section 3.5. The only difference is the definition 
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of the strategies y; which are defined for discrete-time dynamic games by Definitions A.4 
and A.5. Therefore, the definitions are not rewritten here. Nevertheless, in the following, 
analogous results to Theorems 3.1 - 3.3 are given. These serve as a basis for the calculation 
of solutions of discrete-time dynamic games. 


Nash Equilibrium 


The following theorems are based on the discrete-time Hamiltonian function 
k k k k k k k 
Hl T rD): u‘ 2 u®) := = g x, af (k) ud) + 


ger f(a!” u, ul kEeK,ieP. (A.6) 


Furthermore, the shorthand notations 


k)* k * k)* k)* 
5 ) f% Nox vor u\ Me . age ) (A.7) 
k)x k * k)x k)* 

is = = g Qa u l E 3 su ) (A.8) 


are introduced. 


The following theorem is the discrete-time counterpart of Theorem 3.1. 


Theorem A.1 (Necessary Conditions for Open-Loop Nash Equilibria in Dis- 
crete-Time Dynamic Games) 

For an N-player discrete-time infinite dynamic game, let FEP, u, 
9D, i (x, u, u) be continuously differentiable on R” forallk e K,i € P. 

Then, if (iG), - + Yy(%1)) with y;(xi) = u} provides an open-loop Nash equilibrium 


solution with x* as the corresponding state trajectory, there exists a finite sequence of costate 
D (ke) 


u™®) be convex and 


functions (Yp i -Ppi ) i € P such that the following relations are satisfied: 
x+) = =f, x = x (A.9a) 
u = = arg min HP Gin x, u”, u“) (A.9b) 
uf) 
k k k x —(k)* _ (k)* 
Phi = Ved; (wi u, (A.9c) 
pi =0, (A.9d) 


where V „œ denotes a partial derivative with respect to the states x"). 
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Proof: 
See e.g. the proof of Theorem 6.1 of [BO99]. 


Before presenting the theorem which represents necessary and sufficient conditions for feed- 
back Nash equilibria, the discrete-time value function is defined. 


Definition A.6 (Value Function) 


Consider a player i € P. Let the optimal strategies of the other players y* ; associated to an 
N-player non-cooperative discrete-time infinite dynamic game be given. The value function 


V:R"x KR of player i is defined by 
A = i = (x). 49) m 
V;(x, k) Ic ID,i (x ViVi |o (A.10) 


where x‘*) = x. 


The following theorem is the discrete-time counterpart of Theorem 3.2. 


Theorem A.2 (Necessary and Sufficient Conditions for Feedback Nash Equilib- 
ria in Discrete-Time Dynamic Games) 

For an N-player discrete-time dynamic game, an N-tuple of feedback strategies 
vi, ty yo) provides a feedback Nash equilibrium (FNE) solution if, and only if, there 
exist value functions V; according to Definition A.6 such that the following recursive rela- 
tions are satisfied for all players i € P: 


| x(k)* k Ck) k 
Vilx, k) = min | g(x, ul) + Vi (Fo u), k + 1)| 
“i (A.11) 
~(k) x p(k)» 
u t V (Foi EuP) k+); Vike) =0, 


i 
where 


~(k)* k k)* k 
fp u) = fpl, u), 


(k)) — 5 (k)) 


x(k) ( (k)* sn) 
In, % u; Ip, (x yj; (œ), u; 


The corresponding Nash equilibrium cost for player i is V;(x,, 1). 


Proof: 
See the proof of Theorem 6.6 of [BO99]. 
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Theorem A.2 gives not only sufficient conditions for FNE (cf. Theorem 3.2), but also necessary 
conditions. Its core consists of the N Bellman equations (A.11) which, analogous to the single- 
player case, follow from the principle of optimality stated by Bellman [Bel66].°’ For dynamic 
games, the Bellman equations imply that the N inequalities corresponding to the definition 
of the Nash equilibrium must hold true for all possible local games (with y e TYP) defined 
at each possible initial point x“), k € K, thus leading to the strong time consistency property 
of the FNE. 


Pareto Efficient Solutions 


The following theorem presents necessary and sufficient conditions for Pareto efficient solu- 
tions in discrete-time dynamic games. It constitutes the counterpart of Theorem 3.3. 


Theorem A.3 (Necessary and Sufficient Conditions for Pareto Efficient Solu- 
tions in Discrete-Time Dynamic Games) 
Letti > 0, for alli € P, satisfy 
N 
oy q =1. (A.13) 


i=1 


Now consider an N-player differential game. Ify?” = {y?,..., yR} is such that 


N 
y” = arg min Ti Jily) (A.14a) 
y i=1 
w.r.t 
ED = F(x, uP, u) (A.14b) 
x(1)= x; (A.14c) 


then y? is a Pareto efficient solution (PES). Moreover, if the strategy spaces T; are convex 
and J; are convex in ui” for alli € P, k € K, then for all Pareto-efficient y” there exist t 
such that y? solves the optimization problem in (A.14). 


Proof: 

The theorem is stated analogously to Theorem 3.3. According to [LZ18], both the sufficiency 
(first theorem assertion) and the necessary part which are taken from the continuous-time 
result are valid for the discrete-time case. o 


5% The principle of optimality as stated in [Bel66] reads: "An optimal policy has the property that whatever the 


initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to 
the state resulting from the first decision". This result was used to derive the Bellman equation in single-player 
optimal control (see e.g. [Kir04, Chapter 3]). 
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Similar to Theorem 3.3, the optimization problem (A.14) allows the use of the discrete-time 
minimum principle to solve for the PES. Further results concerning the necessary and suf- 
ficient conditions, in terms of the minimum principle corresponding to the problem defined 
by (A.14), are presented in [LZ18]. 


A.5 Discrete-Time Linear-Quadratic Dynamic Games 


Analogously to LQ differential games, discrete-time LQ dynamic games are defined as fol- 
lows. 


Definition A.7 (Linear-Quadratic Dynamic Game) 


A linear-quadratic dynamic game is defined by the same elements as Definition A.2. The 
system dynamics are linear, i.e. are defined by 


N 
xD = Apx® + > Bp, ju) (A.15) 
j=l 


where x € R”, u € R™. The cost functions are quadratic, i.e. 


kg N 
1 k k 
J= EY (x7 9,24 D Ra), (A16) 
k=1 j=1 


where Q,,R;; are symmetric for alli, j € P and Ri; > 0. 


The positive semidefiniteness of Q; and R;j, i, j € P, i # j can be sometimes required in order 
to state necessary and sufficient conditions for Nash equilibria in open-loop and feedback 
information structures by means of discrete-time coupled Riccati equations. These equations 
are also derived from the discrete-time minimum principle, i.e. Theorem A.1, and the coupled 
HJB equations, i.e. Theorem A.2, respectively. In this thesis, a quasi-continuous analysis 
was considered such that the trajectories of states and controls in LQ dynamic games were 
generated by the continuous-time RDEs. Therefore, the discrete-time Riccati equations are 
not explicitely given here. The reader is referred to 


« [BO99, Theorem 6.2] for discrete-time Riccati equations in LQ open-loop dynamic 
games 


« [BO99, Corollary 6.1] for discrete-time Riccati equations in LQ feedback dynamic games 


« [BO99, Proposition 6.3] for discrete-time Riccati equations in infinite-horizon LQ feed- 


back dynamic games. 


B Mathematical Supplements 


In this section, further mathematical details are given which complement various sections of 
this thesis. 


B.1 Proof of Theorem 3.4 


To the best of the author’s knowledge, the precise formulation of Theorem 3.4 is not available 
in literature. Similar results can be found in [BO99, Theorem 6.12]. However, a formulation 
similar to the results in [Eng05] was chosen in this thesis in favor of simplicity. 
Proof: 
[Eng05, Theorem 7.2] states that an OLNE exists if the coupled RDEs (3.60) with conditions 
(3.61) have a solution P;, i € P and additionally, a symmetric solution P;(t) to the non- 
coupled RDE 

Pitt) = —A'P;(t) - P)(t)A + P,(t)S;Pi(t) - Q,(T) (B.1) 


exists on [0, T] for all players i € P. Under the theorem conditions Q; = 0 and Q; p = 0, 
i € P, results of the theory of differential equations can be leveraged to state that the solu- 
tions P;(t) of (B.1) are guaranteed to exist (cf. proof of [BO99, Proposition 5.3]. The theorem 
assertion follows. 


B.2 Equivalence of Cost Functions 


Inverse optimal control and inverse dynamic game problems have an inherent ill-posedness 
property. We give in this section definitions of the equivalence of cost functions in an optimal 
control and dynamic game scenario. 


B.2.1 Optimal Control 


In an optimal control problem, where optimal control trajectories u*(t) which minimize a 
cost function J are sought, more than one cost function exists which would lead to the same 
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optimal control u*(t). Consequently, if the system dynamics are unchanged, they lead to the 
same state trajectories x*(t). Mathematically, this means that even if 


u) + J? (ula), (B.2) 
it is still possible to obtain 


arg min J‘ (u(t)) = arg min J (u(t). (B.3) 
u(t) u(t) 


For example, it is a well-known fact that (B.3) holds for J®(u(t)) = cJ(u(t)), c € R*. Never- 
theless, according to [NF04], the illposedness of a general inverse LQ optimal control problem 
may transcend the ill-posedness due to a positive real constant. Therefore, it is conceivable 
that this property is still present in a general inverse (non-LQ) optimal control problem. To 
define when two cost functions are equivalent, we introduce the following definition. 


Definition B.1 (Equivalence of Cost Functions in an Optimal Control Problem) 


Two cost functions J and J® are equivalent if and only if 
SY = SË (B.4) 


where SY, j € {1,2}, denotes the set of solutions for cost function J®, Le. 


SD = fuo | u(t) = arg min aw) ; (B.5) 
u(t) 


B.2.2 Differential Game 


An N-player differential game can be considered a generalization of an optimal control prob- 
lem. Consequently, the ill-posedness issues discussed in the last section are valid in this more 
general case as well. Analogously to Definition B.1, it is possible to define two equivalent cost 
functions of a specific player i in a differential game with N players. 
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Definition B.2 (Equivalence of Cost Functions in a Differential Game) 


Two cost functions J”? and J? are equivalent if and only if 
(1) g2 
SES, (B.6) 


where Sy € {1,2} denotes the set of solutions of cost function J®, ie. 


SS fu | u;(t) = agmina) ; (B.7) 


u(t) 


This definition can be interpreted as follows. Let J=; represent N — 1 cost functions except 
the cost function of player i. If these cost functions are fixed, then according to Definition 
B.2, two cost functions for player i are equivalent if and only if, together with J_;, they lead 
to the same Nash equilibrium. 


B.3 Calculation of Open-Loop Nash Equilibria With the 
Minimum Principle 


Section 3.6.1 presented Theorem 3.1 as necessary conditions for OLNE which consist of sev- 
eral coupled differential equations. Under certain restrictions, these can be used to state a 
two-point boundary value problem (TPBVP) to solve for Nash equilibrium state trajectories 
x*(t). The following lemma represents a useful result for this purpose. 


Lemma B.1 
Consider an N -player differential game where the system dynamics are affine in the controls, 
Le. 


N 
X(t) = F(t), ul)... un (t), t) = fe), + 2, Gilx, Dult) (B.8) 
i=1 


and the running costs g; ofthe cost function J; in (3.3) are given by 


gi(x(t), u(t), ..., un (t)) = gi (x(t), ui(t)) +... + gi,n(x,un(t)), ViEeP. (B.9) 
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Furthermore, assume that the functions u; +> gj,;(x(t),uj(t)) are strictly convex for all 
i,j € P and that g;,; has superlinear growth, i.e. 


(B.10) 


Then, for every (x,t) € R” x [0,T] and every tuple (w,,....~y) € R” x... x R”, the 
minimization problem 


u;(t) = arg min{w; G(x, t)u;(t) + gi, i(x(t), uj(t))} (B.11) 


has a unique solution. 


Proof: 
The proof is analogous to the proof of Lemma 4.1 in [Bre11] in a two-player case.. o 


The implications of Lemma B.1 are explained in the following. By using the n algebraic 
equations defined by (3.17) and the results of Lemma B.1, už (t) can be written as the unique 
map 


uilt) = nj (x(t), pCt), t). (B.12) 


By inserting (B.12) in (3.16a) and (3.16c), we obtain a system of coupled non-linear differential 
equations consisting of 


x(t) = f (x(t), ni), nC), t) (B.13) 

p(t) = -V,.Hp,(), x(t), n;(t), n\,®., t), (B.14) 

where n;(t) and n° ,(t) is used as a short notation for = n4 (x(t), p,(t)) and n* (x(t), p_,(t)), 
respectively, and the boundary conditions 

x*(0) = xo (B.15a) 

p(T) = VYxhi(x(T)). (B.15b) 

The TPBVP arising from (B.13), the differential equations (B.14) for each i € P and boundary 


conditions (B.15) can be solved using numerical methods, e.g. shooting methods or colloca- 
tion methods [AMR95, Chapter 4]. The solution of this TPBVP describes an OLNE. 


B.4 Open-Loop Nash Equilibrium of the Ball-on-Beam System 
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B.4 Open-Loop Nash Equilibrium of the Ball-on-Beam 


System 


In this section, details on the computation of the OLNE for the differential game with the ball- 


on-beam system considered in Section 7.4 are provided. In the following, time dependencies 


shall be omitted for brevity. Furthermore, all equations with the index i refer to player i € 


{1, 2}. The ball-on-beam system dynamics are given by 


X2 
mpri (X1X4—Ge sin(x3)) 
Op +mp A 
X4 


—2mp X1X2X4—Mp Je X1 COS(X3)+Uy +U2 


mpxi+Ow 


and the cost functions are defined as 


T 


Ji = [ore dt 


0 


with the parameter vector 0; € R°“! and the basis function vector 


The corresponding Hamiltonian is 
H; = 0) Qi +Y; f(x, ui, umi). 
Using (3.17), we obtain for each players’ controls 


Via 


u: = n;(x,y;) = - — —— —. 
i= 9) 20, s(mpx2 + Ow) 


Next, we apply (3.16c) to obtain 


261,1%1 + Yi, (Vx fan) + Wia(Vx fan 
20; 2X2 + Yi (Vx Fa, + Wia(Vx fad 


Tz 20; 3x3 + Wi,2(Vx f(2,3) + Wia(Vx f )a,3) 


20; 4X4 + Yi (Vx aa + Yis Vx 3,4 + Yis Vafaa 


(B.16) 


(B.17) 


(B.18) 


(B.19) 


(B.20) 
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where (Vx f)(7,c), r, € € {1,...,4} denote the elements of the matrix of partial derivatives 


0 1 0 0 

mor,x 0 -gempr,, cos(x3) 2myrixıxa 

mpr? +0p mpr? +0p mpr? +O, 

Vif ; > > (B.21) 
0 0 0 1 
D _ 2mpxıxa JeMpx, sin(x3) _ 2mpx1x2 
Zz 
with 
ns —2MpX2Xg — JeMp COS(X3) = 2mpx1(uy + UZ — 2MpX1X2X4 — geMmpX1 COS(X3)) 


Z z2 
ZS mpx? + Oy. 


Following the procedure described in Section B.3, we insert (B.19) in (B.16) and obtain the 
system dynamics 


X2 
mpri (x1x4—ge sin(x3)) 
x= Optmer, , (B.22) 
X4 
f 


where 


m= (-49, (5) 92,(5)X2x4 — 201,(5)02,(5)9e cos(x3)) myx1Z — Wr,(4)O2,(5) — Y20016) (B.23) 
4 20) (5) 92,(5)Z” f f 


Furthermore, we insert (B.19) in (B.20) and obtain the same costate differential equation, yet 
with ; 
—2MpX2X4 — JeMp cos(x3) 2mpXx1 fy, 
V PA N ATA 
(Vx May 7 Z 
The system dynamics (B.22) and the differential equations of y; and y, defined by (B.20) 


and (B.24) constitute a TPBVP which can be solved numerically. In this thesis, the MATLAB 
function bvp4c is used which applies a collocation method (see [SKR00]). 


(B.24) 


B.5 Approximations for the Maximum Entropy 
Probability Density Function 


This section presents the steps needed for the approximation result of the probability density 
function given in (6.47). For brevity, the subscript i is omitted from all variables related to 
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player i in the following. Likewise, for the following derivations are based on the assumption 
that one single demonstration (n; = 1) lies at hand such that the subscript l can also be 
neglected. 


Inserting (6.44) in (6.24) results in 
p({|¢) = p(ili.2”.6) 


-JÄ | i í au] 


© T SAT 5 = 
w eH) | il I@-(u-ü)'9-4(u-ü) “ea 


-1 


= ia EREE -1 (B.25) 


We note that the relation 


u'G'G'-u'G'G'+g'G')G(u-u+G'"Q) (B.26) 


holds due to the symmetry of the second derivative G of the cost function. By appling (B.26) 
in (B.25), the right hand side results in 


garea EEDE k (8.27) 
Finally, since 
a EG) OOD) gy ER (B.28) 


(ar) Zy] 


holds for a multidimensional Gaussian distribution with the mean u, and the covariance 
matrix X y, we may rewrite (B.27) and obtain the approximated probability density function 


p(G| 0) x e (29'S '9} det(G)? (27) 24) , (B.29) 
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where dim (u) = mkg denotes the dimension of u. From (B.29), the approximated log- 
likelihood function results in 


n£ (žl 0) =In (p (ul it_,,x",0)) 


1 1 1 (B.30) 
z -39 O'g + an (det(G)) — „im (u) In (27). 


B.6 Implementation of the Direct Bilevel Approach 


The DB approach used for comparison in this thesis is based on the minimization of the 
cost functional (7.1) which depends on the current candidate trajectories ug ;(t) and x@(t). 
These trajectories must be Nash equilibrium trajectories under an arbitrary parametrization 
of the cost functions 0. The solution of a forward dynamic game with the parameters 0 is 
therefore nested inside the objective function in (7.1). Consequently, each of the objective 
function evaluations will include the solution of a forward dynamic game to determine an 
OLNE or a FNE, depending on the considered case. We note that the search for 0 might lead to 
cost function parameter candidates for which a Nash equilibrium does not necessarily exist. 
Proving the existence of Nash equilibria is in general not trivial. For example, in the case of 
linear-quadratic differential games, the existence of Nash equilibria depends on the existence 
of the solution to the coupled Riccati differential equations, yet its existence has only been 
proved under strong assumptions. Furthermore, the proofs are not very useful for practical 
implementation. Therefore, existence of Nash equilibria cannot be ensured by introducing 
optimization constraints. Nevertheless, probably inspired by the optimal control case (cf. 
assumptions in the results summarized in [Kué73]), literature on (linear-quadratic) dynamic 
games usually introduce constraints of the kind 


C = {0; | Oi) = 0, Yi € P,j € {1,....Mi}}. (B.31) 


This constraint set was implemented in the minimization of the objective function for the DB 
approach. The occurence of succesful calculations of Nash equilibria was indeed increased 
with this set. Nevertheless, it was not enough to completely avoid failure. Therefore, the ob- 
jective function was augmented by a resetting procedure of the candidate trajectories (poten- 
tially leading to greater costs) which became active if the forward problem, i.e. the numerical 
solution of the corresponding RDEs or the TPBVPs did not converge. 


The algorithm describing the cost functional to be evaluated in each iteration of the opti- 
mization problem is listed below. 
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Algorithm 5 Cost Functional for the Direct Bilevel Approach in Inverse Differential Games. 


Input: Parameter candidates 0, observed trajectory set D, dynamics f, basis functions @;. 
Output: Sum of squared errors Jpg 
1: Attempt calculation of Nash equilibrium trajectories x(t) and ug ;(t). 
2: if Calculation fails then 
3: Set x(t) = 0 and ug ;(t) = 0, Vj E€ P. 
4: end if 
5: Calculate sum of squared errors between candidate trajectories and observed trajectories 
Jos. 
6: return Jpp. 


Therefore, the DB method used for the simulation results of Chapter 7 consists of the mini- 
mization of the cost functional described by Algorithm 5 subject to the constraints (B.31). 


B.7 Solutions of the LQ Tracking Problem in the 
Cooperative Steering Model 


This section presents reformulations of the LQ tracking problem arising in Section 8.2.1 to 
a standard LQ problem which allows an easier solution of the differential game. First, the 
general approach is presented. It is based on the reformulation proposed for the single-player 
case in [ML14]. Afterwards, the reformulations specific to the problem of Section 8.2.1 are 
shown. In the remainder of this section, time dependencies of all variables will be omitted 
for better readability. 


B.7.1 General Reformulation to a Standard LQ Problem 


To begin the reformulation, the state variable X = [x z|" is introduced which combines 
the system states and the corresponding reference trajectories. With this new state, we define 
an extended system including the original system dynamics as well as the reference model 
dynamics: 


X = AX + Bu; + Bouy 
i ~ A 0 ~ Bi ; 

with A a 2; B; = ne ie {1,2}. (B.32) 
Due to the infinite horizon, the cost function (8.5) can only be applied if H is Hurwitz. This 


is a considerable restriction, since application-relevant reference signals, e.g. sinusoidal and 
step functions, will not lead to a Hurwitz reference system matrix. In order to circumvent 


XXXVI B Mathematical Supplements 


this problem, we introduce a discount factor f such that 0 < f < 1 in the cost function, thus 
avoiding infinite costs. 


We note that the tracking error e can be written as e = TX, where T = [In -I,,| and I, 
is an n-dimensional identity matrix. With this transformation matrix and with the discount 
factor ß, we rewrite (8.5) as 


oo 


Ji = i exp(—Bt)X'T'O,TX + u; Riiul; dt 


0 


oo 


= J exp(-Bt)X"Q,X + uj Ryu; dt (B.33) 
0 
where 


According to Modares and Lewis [ML14], the optimal control problem consisting of the sys- 
tem dynamics (B.32) and the cost function (B.33) for any i € {1,2} to be minimized can be 
reformulated as an optimal control problem with a cost function without any discounting 
factor £, but with the new system matrix A — 0.5ßI, instead of (B.32). This is necessary in 
order to ease the calculation of the solution and to prove its existence. In their paper [ML14], 
Modares and Lewis state that the solution exists if the matrix A — 0.51, is Hurwitz. 


B.7.2 Transformed System Dynamics and Cost Functions of the 
Cooperative Steering System 


Given that we apply constant reference values, H = 0 holds for the reference system matrix 
in (8.4). Moreover, the velocity reference signal is zero. Therefore, we neglect this term before 
applying the aforementioned transformation. In this way, we obtain system dynamics of the 
form (B.32) with the extended state X = [ö 10) Pret | "| This leads to a transformed system 
(B.32) with 


de Ce 1 

~ E Osum m Osum ~ ~ Osum 

A = 1 0 0 y Bı = By = 0 N (B.35) 
0 0 0 0 


za 2 "| (B.36) 
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Since our steering wheel system is stabilizable and the reference system with H = 0 is 
marginally stable, any 6 > 0 suffices to make the extended system stabilizable and, conse- 
quently, to make the transformation applicable. We choose a small value of $ = 0.01, leading 
to a modified cost function 


oo 


i= i exp(—Bt)X"O,X + Riu? dt, (B.37) 
0 
where 
qı 0 0 
QO; = T'Q,T =10 qe. ql. (B.38) 
0 -qz @ 


Finally, we obtain a standard LQ differential game consisting of the system dynamics matrices 
(A - 0.5ß1„, Bı, Bz) and the cost functions 


Jh = [x 2x + Ru? dt. (B.39) 
0 


For the solution of the inverse LQ dynamic game, parameter constraints are introduced in 
the corresponding optimization problems (constituting the IOC, IRL and DB approaches) such 
that the structure of the cost function matrix in (B.38) is ensured. 


C Supplementary Results on the Solution 
Sets for Inverse Linear-Quadratic 
Differential Games 


The following results complement the results of Section 5.3 to illustrate how the properties 
of an inverse LQ differential game are altered depending on the number of states, controls 
and players. 


All results are based on the general structure of a quadratic cost function given by 


aS N 
1 
Ji(xo, K, Q;, Rij) = sf x" Q;x + yay Ryu dt. 


ja 
A two-player and a three-player inverse LQ differential game are considered exemplarily. 


Figures C.1 and C.2 shows a 3D map for analyzing the dimensions of the matrix M; for inverse 
LQ differential games with N = 2 and N = 3, respectively, with symmetric and diagonal cost 
function matrices and different numbers of states n and controls m;. These are analogous to 
Figure 5.1 which showed the case N = 1. The number of equations (rows of M;) and the 
number of parameters M; (columns of M;) are shown as a function of the number of states n 
and the number of controls m;. 


In Figure C.1a and C.2a, the number of parameters M; is always greater than the number of 
equations nm; such that the solution set of player i is at least one-dimensional. In Figures 
C.1b and C.2b, we observe that there are combinations of n and m; which lead to nm; > Mj. 
The black line denotes the cases where nm; = M; — 1 < M; which shows that the kernel is 
guaranteed to exist and is one-dimensional. Therefore, from this line to the left, the solution 
set of player i can be expressed by ker(M;), while the area which is on the right side of the line 
represents the cases where solutions may be found by applying the results of Theorem 5.3. 
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Q Equations nm; 


Parameters M; 
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(a) Symmetrical cost function matrices (b) Diagonal cost function matrices 


Figure C.1: Number of parameters and equations depending on the number of states and controls in a two-player 
inverse LQ differential game. The red thick line denotes the case where n m; = M; -1. 
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(a) Symmetrical cost function matrices (b) Diagonal cost function matrices 


Figure C.2: Number of parameters and equations depending on the number of states and controls in a 3-player 
inverse LQ differential game. The red thick line denotes the case where n m; =M; -1. 


D Inverse Cooperative Dynamic Games 
Based on Maximum Entropy Inverse 
Reinforcement Learning 


In this chapter, the probability density function given by (6.17) is leveraged such that a method 
to identify cost function parameters out of a solution of the dynamic game in the sense of 
Pareto is developed. Similar to the results of Chapter 6 the unbiasedness of the estimation is 


proved. 


D.1 Preliminaries 


In this appendix, Pareto efficient solutions are considered which can be described by a global 
cost function given by the sum of weighted player cost functions. Several global cost func- 
tions are possible depending on the selected weighting parameters to build the sum (cf. Sec- 
tion 3.6.3). One particular global cost function is given by the sum of uniformly weighted 


player cost functions defined as follows. 


Definition D.1 (Global Cost Function as Uniformly Weighted Sum) 
The uniformly weighted sum of all player cost functions is given by 
N N 
p= F = = 0; H; =: -05 Hy (D.1) 
i=l i=l 
with 
6=[or ... 1] (D.2a) 
and 
fig = [pt ase ph): (D.2b) 


The following assumption is introduced in order to be able to obtain Pareto efficient solu- 


tions. 
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Assumption D.1 (Convexity of the Global Cost Function) 


The cost functions J; are convex for alli € P. 


Remark D.1: 
It can be noted that 
arg min Js(y) = ar diy DG) (D.3) 
= >(y Er ZN iY), : 
where y := {y1,---»Yy}, holds since multiplying any cost function Js with a constant factor 


c € R* (here 1/n) does not alter the solution of the optimization problem. Therefore, under 
Assumption D.1 and with the results of Theorem 3.3, the minimizer of Js describes a Pareto 
efficient solution of a cooperative game. 


D.2 Identification Method and Unbiasedness of the 
Estimation 


Sections 6.4 and 6.5 presented how to find cost function parameters which explain observed 
trajectories which arise from a noncooperative game with OL and FB Nash equilibrium strate- 
gies. This was done by means of a MLE based on a probability density function. This sec- 
tion presents a similar procedure such that parameters can be found which explain trajecto- 
ries corresponding to a cooperative game with equally weighted cost functions as in Defini- 
tion D.1. 


The inverse dynamic game method is based on the density (6.17) with naturally arises with 
the maximum entropy principle as described in Section 6.3. The first step consists in rewriting 
(6.17) using (D.1) and (D.2), leading to 


exp (05 Hy (2)) 
l exp (OF uy (©) a 
PRO) 
f exp (Jr (0)) ag 


p (2105) = 


(D.4) 


This allows the definition of a likelihood function analogous to the one introduced in Defi- 
nition 6.5. In this case, we denote the likelihood as 
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nt 


L(x | D)=] |p (&1 6s), (D5) 


l=1 
where p (ä | oz) is obtained by evaluating (D.4) at ĝ, 1 € {1,..., n+}. 


The following theorem represents the main result concerning the identification of cost func- 
tions in an inverse cooperative dynamic game with Pareto efficient solutions. 


Theorem D.1 (Unbiasedness of the Identification of Pareto Efficient Solutions) 


Let n, trajectories D = {ő TEN én, } fulfilling Assumption 6.1 be available. Then, the MLE 
with respect to the observed trajectories, i.e. 


Oy = arg max In £ {0s D} (D.6) 
0 


A 


where £L{0y|D} is obtained by evaluating the likelihood function (D.5) at a, 
l € {1,...,22}, leads to parameters Oz such that the resulting probability density function 
p (2105) leads to an expectation of the cost function values Js (¢, 0%) which is equal to the 
one corresponding to the probability density function with original parameters p (£| 03), 
i.e. 


Eyci) IE =E zias) Ue (2 85)} (D.7) 


Proof: 
The proof is analogous to the proof of Theorem 6.1. 


The results of Theorem D.1 imply that the expectation of the global costs (under the original 
parameters) produced by trajectories generated by the probability density functions with 
original and estimated parameters are equal. Note that this result is weaker than the one 
required in (6.7) as it considers only the overall costs. Nevertheless, for a cooperative game, 
it is enough to describe observed trajectories completely. 


Remark D.2: 


Similar to the results of Chapter 6, solving the optimization problem (D.6) demands the possibility 
of evaluating the likelihood function £ { @3| D} and therefore the probability density function 
(D.4) at the trajectories a, l € {1,...,m¢}. The denominator in (D.4) includes an integral over 
all trajectories Č which are feasible with respect to the system dynamics and an initial state. An 
approach analogous to the one presented in Section 6.6 can be applied in this case. 
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Remark D.3: 


The result Oy of (D.6) contains the cost function parameters of all players in one single vector 
according to (D.2). Assuming that the number of features M; is known for every player i € P, 
an individual parameter set 6; can be determined by means of (D.2a) out of Oy. This is done by 
using the relation 


6; = Os(Is : If). (D.8) 
with 


i 
Is =1+ Ma-ı and If = Y Mas (D.9) 
a=l a=l1 
with My = 0 and where Os(Is : ye R'—"+1 denotes a vector that contains the entries IF tol? 
of the vector 05. 


The presented method is capable of identifying cost function parameters which explain tra- 
jectories corresponding to an optimal solution based on uniformly weighted player cost func- 
tions, which is one of the Pareto efficient solutions belonging to the Pareto frontier. Pareto 
efficient solutions can be obtained by minimizing the sum of cost functions of all players 
which are nevertheless not necessarily equally weighted (see Definition 3.9). The presented 
method would not be able to estimate the original parameters 0}, but would be able to deter- 
mine parameters 6; which are also capable of describing the trajectories in this scenario. A 
simulation example where the effectiveness of the presented inverse dynamic game method 
is demonstrated can be found in [IBKH20]. 


E Supplementary Simulation Results 


This chapter gives supplementary results of the simulative evaluation of the inverse dynamic 
game methods performed in Chapter 7. 


E.1 Inverse Nonlinear Open-Loop Dynamic Game 


E.1.1 Robustness to Measurement Noise 


Figures E.1 - E.4 show the trajectory estimation results for different SNR values of the ob- 
served trajectory used for the inverse dynamic game methods. The estimated trajectories are 
determined by solving the dynamic game with the parameters 6;, i € P, identified by each 
of the considered methods, i.e. the parameters from Tables 7.3, 7.5 and 7.7 are used. The 
noisefree case is presented in Figure 7.4 and the 30 dB results are shown in Figure 7.8. 
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Figure E.1: Observed trajectories and estimations based on mean identification results of all methods, SNR = 20 dB 
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Figure E.2: Observed trajectories and estimations based on mean identification results of all methods, SNR = 25 dB 


0.6 T 
Bu — GT --- IOC 
0.4 - >, 7 
z ‘S. #428 IRL ---- DB 
R=! NG 
x 0.2 N, | 
n nn 
of Immun nd 
T T T T 
5 IOON, 4 
f x 


nn 
NG p 
Nom a 
-5 | | | | 
z T T T T 
I 5K, j 
zZ is, 
s Or Loe Saree ART A | 
m or an 
a > 
a g Car | 
| | | | 
0 1 2 3 4 5 
time in s 


Figure E.3: Observed trajectories and estimations based on mean identification results of all methods, SNR = 35 dB 
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Figure E.4: Observed trajectories and estimations based on mean identification results of all methods, SNR = 40 dB 


E.1.2 Robustness to Basis-Function Mismatch 


Figures E.5 and E.6 show the comparison of the trajectories which result from the dynamic 
games solved with the parameters 6;, i € P identified by each of the considered methods. 
The identification is based on observed trajectories generated in Section 7.4.1 and different 
basis functions (cases II and III) as given in Table 7.9. 


E.2 Inverse LQ Feedback Differential Game 


E.2.1 Robustness to Measurement Noise 


The following Tables E.1 to E.6 list the mean values of the identified parameters correspond- 
ing to the matrices QO; and Rij, i € P, over all 100 identification procedures conducted in 
Section 7.5.3, where the robustness of the inverse dynamic game methods to measurement 
noise is evaluated. 


The following Figures E.7 - E.10 show the comparison of the trajectories which result from 
the dynamic games solved with the mean of the parameters 6;,i € P, identified by each of the 
considered methods and based on the observed trajectories with different SNR values. The 
noisefree case is presented in Figures 7.11 and 7.12 and the 20 dB results are shown in Figure 


XLVII E Supplementary Simulation Results 


Sx inm 


pin? 


M; inNm 


time in s 


Figure E.5: Inverse open-loop dynamic game results for all methods in the basis function mismatch case I. 
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Figure E.6: Observed trajectories and estimations based on identification results of all methods in the basis function 
mismatch case III. 
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Table E.1: Mean values of the cost function matrices Q ; identified with IOC 


Q>, mean 


Q3, mean 


SNR in dB Oriar 
20 (0.88, 0.62, 1.28, 0.65) 
25 (0.95, 0.50, 1.69, 0.85) 
30 (0.98, 0.43, 1.91, 0.95) 
35 (1.00, 0.41, 1.98, 0.98) 
40 (1.00, 0.40, 2.00, 0.99) 
oo (1.00, 0.40, 2.00, 1.00) 


(0.80, 0.70, 1.02, 1.11) 
(0.80, 0.68, 1.01, 1.65) 
(0.82, 0.68, 1.00, 1.88) 
(0.85, 0.67, 1.00, 1.96) 
(0.93, 0.63, 1.00, 1.98) 
(1.00, 0.60, 1.00, 2.00) 


(1.29, 0.74, 0.60, 0.55) 
(1.29, 0.82, 0.52, 0.77) 
(1.11, 0.93, 0.51, 0.91) 
(1.04, 0.98, 0.50, 0.97) 
(1.01, 1.00, 0.50, 0.99) 
(1.00, 1.00, 0.50, 1.00) 


Table E.2: Mean values of the cost function matrices R;; identified with IOC 


SNRindB Rı (22), mean 


Ro, (22), mean 


R3, (22), mean 


20 0.99 
25 1.00 
30 1.01 
35 1.01 
40 1.00 
oo 1.00 


0.85 
0.95 
0.99 
0.99 
1.00 
1.00 


1.97 
1.97 
1.99 
1.99 
2.00 
2.00 


Table E.3: Mean values of the cost function matrices Q ; identified with IRL 


Q>, mean 


Q3, mean 


SNR in dB Ones 
20 (0.98, 0.20, 2.06, 0.94) 
25 (0.99, 0.32, 2.02, 0.98) 
30 (1.00, 0.38, 2.00, 0.99) 
35 (1.00, 0.39, 2.00, 1.00) 
40 (1.00, 0.40, 2.00, 1.00) 
oo (1.00, 0.40, 2.00, 1.00) 


(0.93, 0.63, 1.00, 2.12) 
(0.96, 0.62, 1.00, 2.07) 
(0.96, 0.62, 1.00, 2.02) 
(0.99, 0.61, 1.00, 2.00) 
(1.00, 0.60, 1.00, 2.00) 
(1.00, 0.60, 1.00, 2.00) 


(1.09, 0.95, 0.51, 0.90) 
(1.02, 0.99, 0.51, 0.96) 
(1.00, 1.00, 0.50, 0.99) 
(1.00, 1.00, 0.50, 1.00) 
(1.00, 1.00, 0.50, 1.00) 
(0.99, 1.00, 0.50, 1.00) 


Table E.4: Mean values of the cost function matrices R;; identified with IRL 


SNR in dB Ry .(22), mean 


Ro, (22), mean 


R3, (22), mean 


20 0.98 
25 0.99 
30 1.00 
35 1.00 
40 1.00 
oo 1.00 


1.11 
1.05 
1.02 
1.00 
1.00 
1.00 


2.01 
2.01 
2.00 
2.00 
2.00 
2.00 


7.15 and 7.16. As it can be inferred from Figures E.9 and E.10, the trajectory comparison 


for the cases 35 dB and 40 dB yields no visually recognizable improvement. These are not 
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Table E.5: Mean values of the cost function matrices Q ; identified with the DB method 


SNR in dB 


Qi mean 


Q>, mean 


Q;, mean 


20 
25 


(1.17,0.37, 2.22, 1.01) 
(1.08, 0.37, 2.15, 1.00) 
(1.05, 0.39, 2.08, 1.00) 
(1.03, 0.39, 2.04, 1.00) 
(1.01, 0.40, 2.02, 1.00) 
(1.00, 0.40, 2.00, 1.00) 


(1.32, 0.47, 0.99, 2.02) 
(1.25, 0.49, 1.00, 2.02) 
(1.20, 0.51, 1.00, 2.01) 
(1.16, 0.53, 1.00, 2.02) 
(1.09, 0.56, 1.00, 2.00) 
(1.04, 0.58, 1.00, 2.00) 


(1.16, 0.92, 0.57, 1.46) 
(1.10, 0.95, 0.53, 1.22) 
(1.12, 0.94, 0.52, 1.13) 
(1.08, 0.96, 0.51, 1.06) 
(1.05, 0.97, 0.50, 1.04) 
(1.02, 0.99, 0.50, 1.00) 


Table E.6: Mean values of the cost function matrices R;; identified with the DB method 


SNR in dB Rigor) sacks Ro (22), mean N 


20 1.55 0.98 2.16 
25 1.18 0.99 2.01 
30 1.10 1.00 2.04 
35 1.15 1.00 2.02 
40 1.03 1.00 2.01 
co 1.00 1.00 2.00 


explicitely shown here as the result are practically identical to the noisefree case from Figures 
7.11 and 7.12. 


E.2.2 Robustness to Basis Function Mismatch 


Figures E.11 - E.16 show the comparison of the observed trajectories with the ones which 
result from the dynamic games solved with the parameters 6;,i € P identified by each of the 
considered methods. The identification is based on observed trajectories generated in Section 
7.4.1 and incomplete basis functions (cases II to IV) as given in Table 7.17. 


E.2 Inverse LQ Feedback Differential Game LI 


X2 
> 
ar 
on 


X3 
| 
> 
on 


X4 
l 
= 
on 


GT --- IOC 
are IRL-----DB 7 
m 


nn m m m a a aa e 


SE a rn nn m 1 
.— 


A 
| | | | 


EE EL LE m m mn m m 


m] 
Pr — 


2 3 4 5 6 


time in s 


Figure E.7: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted using noise-corrupted trajectories with SNR = 25 dB. 
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Figure E.8: Ground truth and estimated control trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted using noise-corrupted trajectories with SNR = 25 dB. 
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Figure E.9: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted using noise-corrupted trajectories with SNR = 30 dB. 
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Figure E.10: Ground truth and estimated control trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted using noise-corrupted trajectories with SNR = 30 dB. 
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Figure E.11: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; (j, j} = 0 for i € {1,2} 
and j € {3, 4} (case I). 
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Figure E.12: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; (;,;, = 0 for i € {1,2} 
and j € {3, 4} (case II). 
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Figure E.13: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; (j, j) = 0 for i € {1,2} 
and j € {2, 3, 4} (case IN). 
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Figure E.14: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; j, j) = 0 for i € {1, 2} 
and j € {2, 3, 4} (case IN). 
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Figure E.15: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; = 0 fori € {1, 2} (case 


IV). 
kii ùa) = Uz) --- IOC | | 
S O86 wN } } } } aee IRL -= - DB | 


u3 


time in s 


Figure E.16: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each 
method. The identification was conducted with the wrong assumption that Q; = 0 for i € {1, 2} (case 
IV). 


F Supplementary Results of the Application 
in Shared Control 


This section gives further information on the results of Chapter 8. 


F.1 Further Details on the Experimental Setup 


This section provides details on the parameters of the two steering wheels and on the devel- 
oped control structure which realizes their virtual coupling are presented. 


F.1.1 Steering Wheel Parameters 


The following Table F.1 lists the parameters of the two steering wheels belonging to the 
cooperative steering system. 


Table F.1: Steering wheel parameters 


Parameter SW1 SW2 Description 
Oj 0.04 kgm? 0.054 kgm? Rotational inertia 
Cj 0.573 Nm/rad 0.573 Nm/rad Spring constant 
d; 0.430 Nm-s/rad 0.430 Nm/rad Damping constant 


F.1.2 Steering Wheel Coupling Control 


The steering wheels were coupled using a control algorithm which emulates a spring-damper 
element between them. This kind of coupling was first presented in [LDFH14], where it was 
also used in a study for analyzing haptic interaction between humans. 


Figure F.1 shows the control loop of the cooperative steering system. The controller calculates 
a torque M(t) which is equally distributed on each steering wheel. The aim is to regulate the 
difference e¢(t) = eges — Emeas(f) towards zero, where emeas(t) = ~i(t) — 2(t) is the difference 
of measured steering angles and eges = 0 is the desired angle difference. 
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Figure F.1: Control structure of the cooperative steering system 


The controller R(s) is designed as a proportional-derivative (PD) controller with a low-pass 
filter positioned before the derivative part in order to supress measurement noise. The struc- 
ture of the PD controller is illustrated in Figure F.2. The controller behavior is defined by the 


transfer function 
_ M(s)_ sKp 


Es) 1+Tps 


R(s) Kp, (F.1) 


where M(s) and E(s) denote the Laplace transform of the torque M(t) and the control error 
e.(t), respectively. The variables Kp and Kp denote the coefficients of the proportional and 
the derivative terms, respectively. The variable T, denotes the time constant of the first-order 
lag filter. The values of these parameters are given in Table F.2. 


Kp 


Tp Kp 0 
Fo 


Figure F.2: PD controller used for the coupling of the steering wheels 


ee 


Table F.2: PD controller parameters 
Parameter Value 


Kp 1.96 
Kp 0.175 
Tp 1.825 ms 


F.2 Supplementary Tables of the Shared Control Identification Results 
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F.2 Supplementary Tables of the Shared Control 


Identification Results 


The following Tables F.3 to F.5 give all trajectory estimation errors for all subject pairs which 


were obtained using the IOC, IRL and DB methods, respectively. 


Table F.3: Cooperative steering experiment: Error between measured trajectories and trajectories obtained with the 


IOC method 


Subject pair é* ou ou 
1 70.014 56.907 56.436 
2 127.429 60.718 85.649 
3 100.466 86.705 59.141 
4 133.885 105.515 78.556 
5 91.833 44.793 65.951 
6 81.696 44.930 71.280 
7 253.893 130.618 86.561 
8 111.529 71.462 74.479 
9 182.829 59.913 100.267 
10 129.138 97.584 72.642 
11 249.695 90.685 192.784 
12 271.043 111.514 126.208 
13 80.681 81.442 43.414 
14 142.491 99.035 88.994 
15 107.503 90.802 65.915 
16 196.169 68.491 98.087 
17 134.954 74.166 73.785 
18 123.916 90.849 89.037 
19 113.636 67.701 108.814 
20 126.179 77.609 110.219 
21 93.533 57.297 93.647 
22 109.570 197.063 141.135 
23 171.691 125.290 82.074 
24 178.529 73.403 67.797 
25 246.650 132.693 88.445 
mean 145.158 87.887 88.853 
median 127.429 81.442 85.649 
SD 58.632 33.530 30.747 
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Table F.4: Cooperative steering experiment: Error between measured trajectories and trajectories obtained with the 


IRL method 
Subject pair ôx ori (be 
1 53.719 58.563 58.703 
2 101.236 100.596 90.611 
3 129.557 98.956 63.395 
4 84.566 104.799 75.475 
5 94.571 48.288 72.978 
6 99.932 50.992 69.750 
7 175.364 118.096 80.992 
8 104.913 76.127 93.537 
9 108.529 72.967 77.475 
10 77.535 70.975 66.250 
11 127.078 86.291 125.088 
12 190.018 102.359 97.408 
13 103.761 76.280 49.144 
14 82.052 118.415 100.529 
15 89.313 87.232 61.959 
16 193.824 60.500 112.702 
17 100.953 97.991 79.588 
18 87.889 93.470 96.903 
19 86.435 60.791 97.528 
20 108.915 72.372 103.536 
21 92.718 59.813 100.144 
22 85.351 218.646 149.104 
23 153.241 121.517 79.267 
24 141.748 93.011 62.485 
25 137.070 115.503 77.055 
mean 112.411 90.582 85.664 


median 101.236 87.232 79.588 
SD 35.611 34.569 22.678 
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Table F.5: Cooperative steering experiment: Error between experimentally measured trajectories and trajectories 


obtained with the DB approach 


Subject pair é* ont ou 
1 44.224 57.090 55.853 
2 101.985 61.651 77.223 
3 73.849 67.249 53.884 
4 96.832 102.442 69.903 
5 77.125 66.479 67.224 
6 85.592 60.670 73.956 
7 116.169 105.058 78.649 
8 92.483 70.193 70.751 
9 91.977 62.021 78.398 
10 68.201 76.293 64.553 
11 89.672 80.611 120.724 
12 94.118 81.256 87.326 
13 58.457 76.280 38.762 
14 85.817 93.916 83.197 
15 83.445 85.230 59.801 
16 107.501 58.896 86.753 
17 99.877 70.717 70.081 
18 79.049 75.170 82.617 
19 73.561 54.234 89.718 
20 87.378 59.884 85.506 
21 74.819 53.756 83.710 
22 143.719 76.626 57.984 
23 95.018 107.129 73.908 
24 100.738 86.762 63.648 
25 95.582 114.574 71.243 
mean 88.711 76.168 73.815 
median 89.672 75.170 38.762 
SD 19.372 17.459 15.723 
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F.2.1 Statistical Test Results 


The following Tables F.6 and F.7 give the p-values corresponding to the right-tailed Wilcoxon 
signed-rank test conducted to the data sets of NSAE of states and controls, respectively. In 
Table F.6, the hypothesis is always rejected with a significance level of a = 0.01. The right- 
tailed property leads to the validity of the alternative hypothesis which states that ö* 


median, row 


ôx > 0. The same holds for Table F.7 with the exception of the NSAE of the controls 


median, column 


obtained by the IOC and IRL methods. The hypothesis Hp cannot be rejected and thus their 
difference is not statistically significant. 


Table F.6: p-values of the Wilcoxon signed-rank test with Ho : ”8* 


! -6% ,. comes from a distribution 
median,row median, column 


with zero median". 


IOC IRL DB 
IOC - 1.249- 1074 1.639 - 1076 


IRL - - 1.085 - 1074 


Table F.7: p-values of the Wilcoxon signed-rank test with Ho : "6" comes froma distribution 


; -6" . 
median,row ~median,column 
with zero median" (** denotes the failure of hypothesis rejection). 


IOC IRL DB 


IOC ž - 0.594* 6.995 -1077 
IRL - - 1.597 - 107" 
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